Showing 177 open source projects for "benchmark"

View related business solutions
  • Compliant and Reliable File Transfers Backed by Top Security Certifications Icon
    Compliant and Reliable File Transfers Backed by Top Security Certifications

    Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

    Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.
    Start Free Trial
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • 1
    BEIR

    BEIR

    A Heterogeneous Benchmark for Information Retrieval

    BEIR is a benchmark framework for evaluating information retrieval models across various datasets and tasks, including document ranking and question answering.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    GLM-4.6

    GLM-4.6

    Agentic, Reasoning, and Coding (ARC) foundation models

    ...Its reasoning capabilities have been strengthened, including improved tool usage during inference and more effective integration within agent frameworks. GLM-4.6 also enhances writing quality, producing outputs that better align with human preferences and role-playing scenarios. Benchmark evaluations demonstrate that it not only outperforms GLM-4.5 but also rivals leading global models such as DeepSeek-V3.1-Terminus and Claude Sonnet 4.
    Downloads: 39 This Week
    Last Update:
    See Project
  • 3
    AgentBench

    AgentBench

    A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

    AgentBench is an open-source benchmark designed to evaluate the capabilities of large language models when used as autonomous agents. Unlike traditional language model benchmarks that focus on static text tasks, AgentBench measures how models perform in interactive environments that require planning, reasoning, and decision-making. The benchmark includes multiple environments that simulate realistic scenarios such as web interaction, database querying, and problem solving tasks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    LongBench

    LongBench

    LongBench v2 and LongBench (ACL 25'&24')

    ...LongBench addresses this gap by providing datasets that require models to process and reason over long sequences of text across multiple tasks. The benchmark includes multiple categories such as single-document question answering, multi-document reasoning, summarization, long dialogue understanding, and code analysis. It supports bilingual evaluation in English and Chinese to assess multilingual capabilities across extended contexts. Newer versions of the benchmark introduce extremely long context windows ranging from thousands to millions of tokens, enabling researchers to test the limits of modern long-context models.
    Downloads: 0 This Week
    Last Update:
    See Project
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 5
    AICGSecEval

    AICGSecEval

    A.S.E (AICGSecEval) is a repository-level AI-generated code security

    AICGSecEval is an open-source benchmark framework designed to evaluate the security of code generated by artificial intelligence systems. The project was developed to address concerns that AI-assisted programming tools may produce insecure code containing vulnerabilities such as injection flaws or unsafe logic. The framework constructs evaluation tasks based on real-world software repositories and known vulnerability cases derived from CVE records.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    DeepSeek V2

    DeepSeek V2

    Strong, Economical, and Efficient Mixture-of-Experts Language Model

    ...The V2 model is expected to support more advanced features like better context window handling, more efficient inference, better performance on challenging tasks, and stronger alignment with human feedback. Because DeepSeek is pushing open-weight competition, this V2 iteration is meant to solidify its position in benchmark rankings and in developer adoption. The code in the repository may include description files, support for tool use or plug-in architectures, and artifacts showing fine-tuning or prompt templates.
    Downloads: 36 This Week
    Last Update:
    See Project
  • 7
    HumanEval

    HumanEval

    Code for the paper "Evaluating Large Language Models Trained on Code"

    ...The benchmark has become a standard for evaluating code generation models, including those in the Codex and GPT families. Researchers can use the dataset to run reproducible comparisons across models and track improvements in functional code synthesis. By focusing on correctness through execution, human-eval provides a rigorous and practical way to evaluate programming capabilities in AI systems.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    SmallCode

    SmallCode

    AI coding agent optimized for small LLMs. 87% benchmark

    ...Its workflow is built around terminal usage, which makes it suitable for developers who prefer command-line control and local project context. smallcode emphasizes efficient agent behavior, careful tool use, and benchmark-driven improvements for constrained models. Its main value is giving developers a compact coding-agent environment that treats small local models as first-class tools.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Meta Agents Research Environments (ARE)

    Meta Agents Research Environments (ARE)

    Meta Agents Research Environments is a comprehensive platform

    ...Unlike static benchmarks, ARE supports environments where agents must adapt to changes over time and reason over sequences of actions. It interacts with applications and faces uncertainty. The included Gaia2 benchmark offers 800 scenarios across multiple “universes”. It can test reasoning, memory, tool use, and adaptability. Integration with simulated applications/agent APIs (email, file system, etc.). Support for multiple AI model backends/providers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Error to trace to log to deploy. One click. No SSH. Icon
    Error to trace to log to deploy. One click. No SSH.

    Catch the cause before the pager goes off.

    AppSignal links every error to the trace, the trace to the log, the log to the deploy that shipped it.
    Free 30 days.
  • 10
    DeepSeek-OCR 2

    DeepSeek-OCR 2

    Visual Causal Flow

    ...It is designed to handle complex layouts and noisy documents by giving the model causal reasoning capabilities that mimic human visual scanning behavior, enhancing OCR performance on documents with rich spatial structure. The repository provides model code and inference scripts that let researchers and developers run and benchmark the system on both images and PDFs, with support for batch evaluation and optimized pipelines leveraging vLLM and transformers.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 11
    MTEB

    MTEB

    MTEB: Massive Text Embedding Benchmark

    ...This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    SAM 3

    SAM 3

    Code for running inference and finetuning with SAM 3 model

    ...This capability is grounded in a new data engine that automatically annotated over four million unique concepts, producing a massive open-vocabulary segmentation dataset and enabling the model to achieve 75–80% of human performance on the SA-CO benchmark, which itself spans 270K unique concepts.
    Downloads: 27 This Week
    Last Update:
    See Project
  • 13
    SDGym

    SDGym

    Benchmarking synthetic data generation methods

    The Synthetic Data Gym (SDGym) is a benchmarking framework for modeling and generating synthetic data. Measure performance and memory usage across different synthetic data modeling techniques – classical statistics, deep learning and more! The SDGym library integrates with the Synthetic Data Vault ecosystem. You can use any of its synthesizers, datasets or metrics for benchmarking. You also customize the process to include your own work. Select any of the publicly available datasets from the...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Hallucination Leaderboard

    Hallucination Leaderboard

    Leaderboard Comparing LLM Performance at Producing Hallucinations

    ...By focusing on hallucination rates rather than traditional metrics such as accuracy or fluency, the benchmark highlights an important aspect of AI system safety and trustworthiness. The leaderboard is regularly updated as new models are released and evaluation methods evolve.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    CodeGeeX

    CodeGeeX

    CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)

    ...Developed with MindSpore and later made PyTorch-compatible, it is capable of multilingual code generation, cross-lingual code translation, code completion, summarization, and explanation. It has been benchmarked on HumanEval-X, a multilingual program synthesis benchmark introduced alongside the model, and achieves state-of-the-art performance compared to other open models like InCoder and CodeGen. CodeGeeX also powers IDE plugins for VS Code and JetBrains, offering features like code completion, translation, debugging, and annotation. The model supports Ascend 910 and NVIDIA GPUs, with optimizations like quantization and FasterTransformer acceleration for faster inference.
    Downloads: 12 This Week
    Last Update:
    See Project
  • 16
    AutoAgent AI

    AutoAgent AI

    Autonomous harness engineering

    ...Instead of manually tuning prompts or workflows, developers define high-level goals in a configuration file, and the system continuously modifies its own tools, orchestration, and logic based on benchmark performance. It operates through a loop of testing, analyzing failures, and refining the agent’s configuration to maximize a scoring metric. The framework uses a single-file agent harness combined with structured tasks and evaluation suites to guide optimization. It runs inside Docker for safe execution and reproducibility. This approach shifts agent development from manual design to automated optimization. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    PyTorch Geometric

    PyTorch Geometric

    Geometric deep learning extension library for PyTorch

    It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers. In addition, it consists of an easy-to-use mini-batch loader for many small and single giant graphs, a large number of common benchmark datasets (based on simple interfaces to create your own), and helpful transforms, both for learning on arbitrary graphs as well as on 3D meshes or point clouds. We have outsourced a lot of functionality of PyTorch Geometric to other packages, which needs to be additionally installed. These packages come with their own CPU and GPU kernel implementations based on C++/CUDA extensions. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 18
    D4RL

    D4RL

    Collection of reference environments, offline reinforcement learning

    D4RL (Datasets for Deep Data-Driven Reinforcement Learning) is a benchmark suite focused on offline reinforcement learning — i.e., learning policies from fixed datasets rather than via online interaction with the environment. It contains standardized environments, tasks and datasets (observations, actions, rewards, terminals) aimed at enabling reproducible research in offline RL. Researchers can load a dataset for a given task (e.g., maze navigation, manipulation) and apply their algorithm without the need to collect fresh transitions, which accelerates experimentation and comparison. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Meta-World

    Meta-World

    Collections of robotics environments

    Meta-World is an open-source benchmark suite of robotic manipulation environments focused on multi-task and meta reinforcement learning. It provides a large collection of continuous-control tasks, such as reaching, pushing, opening doors, and manipulating objects with a simulated robot arm. The library defines standardized benchmarks like MT1, MT10, and MT50 for multi-task learning, where a single policy is trained across different numbers of tasks.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    LLM Colosseum

    LLM Colosseum

    Benchmark LLMs by fighting in Street Fighter 3

    LLM-Colosseum is an experimental benchmarking framework designed to evaluate the capabilities of large language models through gameplay interactions rather than traditional text-based benchmarks. The system places language models inside the environment of the classic video game Street Fighter III, where they must interpret the game state and decide which actions to perform during combat. This setup creates a dynamic environment that tests reasoning, situational awareness, and decision-making...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    CUDA Agent

    CUDA Agent

    Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

    ...The system operates in a ReAct-style loop where the agent profiles baseline implementations, writes CUDA code, compiles it in a sandbox, and iteratively refines performance. CUDA-Agent has demonstrated strong benchmark results, achieving high pass rates and significant speedups compared with compiler baselines such as torch.compile.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    ReinforcementLearning.jl

    ReinforcementLearning.jl

    A reinforcement learning package for Julia

    A collection of tools for doing reinforcement learning research in Julia. Provide elaborately designed components and interfaces to help users implement new algorithms. Make it easy for new users to run benchmark experiments, compare different algorithms, and evaluate and diagnose agents. Facilitate reproducibility from traditional tabular methods to modern deep reinforcement learning algorithms. Make it easy for new users to run benchmark experiments, compare different algorithms, and evaluate and diagnose agents. Facilitate reproducibility from traditional tabular methods to modern deep reinforcement learning algorithms. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    openbench

    openbench

    Provider-agnostic, open-source evaluation infrastructure

    ...It bundles dozens of evaluation suites — covering knowledge, reasoning, math, code, science, reading comprehension, long-context recall, graph reasoning, and more — so users don’t need to assemble disparate datasets themselves. With a simple CLI interface (e.g. bench eval <benchmark> --model <model-id>), you can quickly evaluate any model supported by Groq or other providers (OpenAI, Anthropic, HuggingFace, local models, etc.). openbench also supports private/local evaluations: you can integrate your own custom benchmarks or data (e.g. internal test suites, domain-specific tasks) to evaluate models in a privacy-preserving way.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    MLDatasets.jl

    MLDatasets.jl

    Utility package for accessing common Machine Learning datasets

    This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. In contrast to other data-related Julia packages, the focus of MLDatasets.jl is specifically on downloading, unpacking, and accessing benchmark datasets. Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some datasets.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    GLM-4.7

    GLM-4.7

    Advanced language and coding AI model

    GLM-4.7 is an advanced agent-oriented large language model designed as a high-performance coding and reasoning partner. It delivers significant gains over GLM-4.6 in multilingual agentic coding, terminal-based workflows, and real-world developer benchmarks such as SWE-bench and Terminal Bench 2.0. The model introduces stronger “thinking before acting” behavior, improving stability and accuracy in complex agent frameworks like Claude Code, Cline, and Roo Code. GLM-4.7 also advances “vibe...
    Downloads: 68 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
Auth0 Logo