A Heterogeneous Benchmark for Information Retrieval
Agentic, Reasoning, and Coding (ARC) foundation models
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
LongBench v2 and LongBench (ACL 25'&24')
A.S.E (AICGSecEval) is a repository-level AI-generated code security
Strong, Economical, and Efficient Mixture-of-Experts Language Model
Code for the paper "Evaluating Large Language Models Trained on Code"
AI coding agent optimized for small LLMs. 87% benchmark
Meta Agents Research Environments is a comprehensive platform
Visual Causal Flow
MTEB: Massive Text Embedding Benchmark
Code for running inference and finetuning with SAM 3 model
Benchmarking synthetic data generation methods
Leaderboard Comparing LLM Performance at Producing Hallucinations
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
Autonomous harness engineering
Geometric deep learning extension library for PyTorch
Collection of reference environments, offline reinforcement learning
Collections of robotics environments
Benchmark LLMs by fighting in Street Fighter 3
Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
A reinforcement learning package for Julia
Provider-agnostic, open-source evaluation infrastructure
Utility package for accessing common Machine Learning datasets
Advanced language and coding AI model