Python-based research interface for blackbox
A Heterogeneous Benchmark for Information Retrieval
Agentic, Reasoning, and Coding (ARC) foundation models
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
MTEB: Massive Text Embedding Benchmark
A.S.E (AICGSecEval) is a repository-level AI-generated code security
Meta Agents Research Environments is a comprehensive platform
LongBench v2 and LongBench (ACL 25'&24')
Leaderboard Comparing LLM Performance at Producing Hallucinations
Visual Causal Flow
Benchmarking synthetic data generation methods
Code for the paper "Evaluating Large Language Models Trained on Code"
Reference implementations of MLPerf™ training benchmarks
Geometric deep learning extension library for PyTorch
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
A fast serialization and validation library, with builtin
Collection of reference environments, offline reinforcement learning
General plug-and-play inference library for Recursive Language Models
bsuite is a collection of carefully-designed experiments
Provider-agnostic, open-source evaluation infrastructure
Benchmark LLMs by fighting in Street Fighter 3
Collections of robotics environments
Designed for text embedding and ranking tasks
Simulation framework for accelerating research
Minimal examples of data structures and algorithms in Python