A microbenchmark support library
Agentic, Reasoning, and Coding (ARC) foundation models
A benchmarking framework for the Julia language
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
LongBench v2 and LongBench (ACL 25'&24')
A.S.E (AICGSecEval) is a repository-level AI-generated code security
Code for running inference and finetuning with SAM 3 model
Visual Causal Flow
Meta Agents Research Environments is a comprehensive platform
Integrates the JMH benchmarking framework with Gradle
Leaderboard Comparing LLM Performance at Producing Hallucinations
The Abstraction and Reasoning Corpus
Strong, Economical, and Efficient Mixture-of-Experts Language Model
Import public NYC taxi and for-hire vehicle (Uber, Lyft)
Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
General plug-and-play inference library for Recursive Language Models
Collection of reference environments, offline reinforcement learning
Benchmark LLMs by fighting in Street Fighter 3
A Modular Simulation Framework and Benchmark for Robot Learning
Advanced language and coding AI model
Provider-agnostic, open-source evaluation infrastructure
GLM-4.5: Open-source LLM for intelligent agents by Z.ai
Minimal examples of data structures and algorithms in Python
An experimental version of DeepSeek model
Cluster computing framework for processing large-scale geospatial data