Agentic, Reasoning, and Coding (ARC) foundation models
A Heterogeneous Benchmark for Information Retrieval
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
LongBench v2 and LongBench (ACL 25'&24')
A.S.E (AICGSecEval) is a repository-level AI-generated code security
Code for the paper "Evaluating Large Language Models Trained on Code"
Visual Causal Flow
Meta Agents Research Environments is a comprehensive platform
Code for running inference and finetuning with SAM 3 model
MTEB: Massive Text Embedding Benchmark
Benchmarking synthetic data generation methods
Leaderboard Comparing LLM Performance at Producing Hallucinations
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
Autonomous harness engineering
Geometric deep learning extension library for PyTorch
Collection of reference environments, offline reinforcement learning
Collections of robotics environments
Designed for text embedding and ranking tasks
Benchmark LLMs by fighting in Street Fighter 3
Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Provider-agnostic, open-source evaluation infrastructure
Advanced language and coding AI model
A Python toolbox for scalable outlier detection
Advanced Privacy-Preserving Federated Learning framework
SOTA Open Source TTS