Reference implementations of MLPerf™ training benchmarks
Agentic, Reasoning, and Coding (ARC) foundation models
A Heterogeneous Benchmark for Information Retrieval
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
LongBench v2 and LongBench (ACL 25'&24')
Code for the paper "Evaluating Large Language Models Trained on Code"
Code for running inference and finetuning with SAM 3 model
A.S.E (AICGSecEval) is a repository-level AI-generated code security
Visual Causal Flow
MTEB: Massive Text Embedding Benchmark
Meta Agents Research Environments is a comprehensive platform
Benchmarking synthetic data generation methods
Leaderboard Comparing LLM Performance at Producing Hallucinations
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
Strong, Economical, and Efficient Mixture-of-Experts Language Model
Collections of robotics environments
Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Collection of reference environments, offline reinforcement learning
Designed for text embedding and ranking tasks
Benchmark LLMs by fighting in Street Fighter 3
A reinforcement learning package for Julia
Geometric deep learning extension library for PyTorch
A Modular Simulation Framework and Benchmark for Robot Learning
Advanced language and coding AI model
Provider-agnostic, open-source evaluation infrastructure