Showing 50 open source projects for "benchmark"

View related business solutions
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    GLM-4.6

    GLM-4.6

    Agentic, Reasoning, and Coding (ARC) foundation models

    ...Its reasoning capabilities have been strengthened, including improved tool usage during inference and more effective integration within agent frameworks. GLM-4.6 also enhances writing quality, producing outputs that better align with human preferences and role-playing scenarios. Benchmark evaluations demonstrate that it not only outperforms GLM-4.5 but also rivals leading global models such as DeepSeek-V3.1-Terminus and Claude Sonnet 4.
    Downloads: 54 This Week
    Last Update:
    See Project
  • 2
    AgentBench

    AgentBench

    A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

    AgentBench is an open-source benchmark designed to evaluate the capabilities of large language models when used as autonomous agents. Unlike traditional language model benchmarks that focus on static text tasks, AgentBench measures how models perform in interactive environments that require planning, reasoning, and decision-making. The benchmark includes multiple environments that simulate realistic scenarios such as web interaction, database querying, and problem solving tasks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    LongBench

    LongBench

    LongBench v2 and LongBench (ACL 25'&24')

    ...LongBench addresses this gap by providing datasets that require models to process and reason over long sequences of text across multiple tasks. The benchmark includes multiple categories such as single-document question answering, multi-document reasoning, summarization, long dialogue understanding, and code analysis. It supports bilingual evaluation in English and Chinese to assess multilingual capabilities across extended contexts. Newer versions of the benchmark introduce extremely long context windows ranging from thousands to millions of tokens, enabling researchers to test the limits of modern long-context models.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    AICGSecEval

    AICGSecEval

    A.S.E (AICGSecEval) is a repository-level AI-generated code security

    AICGSecEval is an open-source benchmark framework designed to evaluate the security of code generated by artificial intelligence systems. The project was developed to address concerns that AI-assisted programming tools may produce insecure code containing vulnerabilities such as injection flaws or unsafe logic. The framework constructs evaluation tasks based on real-world software repositories and known vulnerability cases derived from CVE records.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • 5
    HumanEval

    HumanEval

    Code for the paper "Evaluating Large Language Models Trained on Code"

    ...The benchmark has become a standard for evaluating code generation models, including those in the Codex and GPT families. Researchers can use the dataset to run reproducible comparisons across models and track improvements in functional code synthesis. By focusing on correctness through execution, human-eval provides a rigorous and practical way to evaluate programming capabilities in AI systems.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    Hallucination Leaderboard

    Hallucination Leaderboard

    Leaderboard Comparing LLM Performance at Producing Hallucinations

    ...By focusing on hallucination rates rather than traditional metrics such as accuracy or fluency, the benchmark highlights an important aspect of AI system safety and trustworthiness. The leaderboard is regularly updated as new models are released and evaluation methods evolve.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    CodeGeeX

    CodeGeeX

    CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)

    ...Developed with MindSpore and later made PyTorch-compatible, it is capable of multilingual code generation, cross-lingual code translation, code completion, summarization, and explanation. It has been benchmarked on HumanEval-X, a multilingual program synthesis benchmark introduced alongside the model, and achieves state-of-the-art performance compared to other open models like InCoder and CodeGen. CodeGeeX also powers IDE plugins for VS Code and JetBrains, offering features like code completion, translation, debugging, and annotation. The model supports Ascend 910 and NVIDIA GPUs, with optimizations like quantization and FasterTransformer acceleration for faster inference.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 8
    LLM Colosseum

    LLM Colosseum

    Benchmark LLMs by fighting in Street Fighter 3

    LLM-Colosseum is an experimental benchmarking framework designed to evaluate the capabilities of large language models through gameplay interactions rather than traditional text-based benchmarks. The system places language models inside the environment of the classic video game Street Fighter III, where they must interpret the game state and decide which actions to perform during combat. This setup creates a dynamic environment that tests reasoning, situational awareness, and decision-making...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    GLM-4.7

    GLM-4.7

    Advanced language and coding AI model

    GLM-4.7 is an advanced agent-oriented large language model designed as a high-performance coding and reasoning partner. It delivers significant gains over GLM-4.6 in multilingual agentic coding, terminal-based workflows, and real-world developer benchmarks such as SWE-bench and Terminal Bench 2.0. The model introduces stronger “thinking before acting” behavior, improving stability and accuracy in complex agent frameworks like Claude Code, Cline, and Roo Code. GLM-4.7 also advances “vibe...
    Downloads: 73 This Week
    Last Update:
    See Project
  • Error to trace to log to deploy. One click. No SSH. Icon
    Error to trace to log to deploy. One click. No SSH.

    Catch the cause before the pager goes off.

    AppSignal links every error to the trace, the trace to the log, the log to the deploy that shipped it.
    Free 30 days.
  • 10
    GLM-4.5

    GLM-4.5

    GLM-4.5: Open-source LLM for intelligent agents by Z.ai

    GLM-4.5 is a cutting-edge open-source large language model designed by Z.ai for intelligent agent applications. The flagship GLM-4.5 model has 355 billion total parameters with 32 billion active parameters, while the compact GLM-4.5-Air version offers 106 billion total parameters and 12 billion active parameters. Both models unify reasoning, coding, and intelligent agent capabilities, providing two modes: a thinking mode for complex reasoning and tool usage, and a non-thinking mode for...
    Downloads: 68 This Week
    Last Update:
    See Project
  • 11
    Qwen2.5-Omni

    Qwen2.5-Omni

    Capable of understanding text, audio, vision, video

    ...It holds state-of-the-art performance in many multimodal benchmarks, particularly spoken language understanding, audio reasoning, image/video understanding, etc. Very strong benchmark performance across modalities (audio understanding, speech recognition, image/video reasoning) and often outperforming or matching single-modality models at a similar scale. Real-time streaming responses, including natural speech synthesis (text-to-speech) and chunked inputs for low latency interaction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Qwen

    Qwen

    The official repo of Qwen chat & pretrained large language model

    Qwen is a series of large language models developed by Alibaba Cloud, consisting of various pretrained versions like Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B. These models, which range from smaller to larger configurations, are designed for a wide range of natural language processing tasks. They are openly available for research and commercial use, with Qwen's code and model weights shared on GitHub. Qwen's capabilities include text generation, comprehension, and conversation, making it a...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 13
    ChatGLM2-6B

    ChatGLM2-6B

    ChatGLM2-6B: An Open Bilingual Chat LLM

    ChatGLM2-6B is the second-gen Chinese-English conversational LLM from ZhipuAI/Tsinghua. It upgrades the base model with GLM’s hybrid pretraining objective, 1.4 TB bilingual data, and preference alignment—delivering big gains on MMLU, CEval, GSM8K, and BBH. The context window extends up to 32K (FlashAttention), and Multi-Query Attention improves speed and memory use. The repo includes Python APIs, CLI & web demos, OpenAI-style/FASTAPI servers, and quantized checkpoints for lightweight local...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    whichllm

    whichllm

    Find the local LLM that actually runs and performs best

    whichllm is a command-line tool for finding local large language models that can realistically run on a user’s hardware. It detects the machine’s available resources, including GPU, CPU, memory, and storage, then recommends models based on practical fit rather than parameter count alone. The project is useful for users who are unsure which local LLM will perform well on their system. It focuses on real, recency-aware benchmarks so recommendations better reflect current model performance....
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    GLM-V

    GLM-V

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning

    GLM-V is an open-source vision-language model (VLM) series from ZhipuAI that extends the GLM foundation models into multimodal reasoning and perception. The repository provides both GLM-4.5V and GLM-4.1V models, designed to advance beyond basic perception toward higher-level reasoning, long-context understanding, and agent-based applications. GLM-4.5V builds on the flagship GLM-4.5-Air foundation (106B parameters, 12B active), achieving state-of-the-art results on 42 benchmarks across image,...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 16
    LongWriter

    LongWriter

    Unleashing 10,000+ Word Generation from Long Context LLMs

    LongWriter is an open-source framework and set of large language models designed to enable ultra-long text generation that can exceed 10,000 words while maintaining coherence and structure. Traditional large language models can process large inputs but often struggle to generate long outputs due to limitations in training data and alignment strategies. LongWriter addresses this challenge by introducing a specialized dataset and training approach that encourages models to produce longer...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    mistral.rs

    mistral.rs

    Fast, flexible LLM inference

    ...It provides multiple entry points for developers, including a CLI for running models locally and an HTTP server that exposes an OpenAI-compatible API surface for easy integration with existing clients. The project includes hardware-aware tooling that can benchmark a system and choose sensible quantization and device-mapping strategies, helping users get strong performance without manual tuning. It also supports serving multiple models from the same server process, enabling routing or quick switching between models depending on workload needs. For user-facing testing, mistral.rs can provide a built-in web UI, and it also offers a dedicated lightweight web chat interface that supports richer interaction patterns.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    VLMEvalKit

    VLMEvalKit

    Open-source evaluation toolkit of large multi-modality models (LMMs)

    ...The toolkit provides a unified framework that allows researchers and developers to evaluate multimodal models across a wide range of datasets and standardized benchmarks with minimal setup. Instead of requiring complex data preparation pipelines or multiple repositories for each benchmark, the system enables evaluation through simple commands that automatically handle dataset loading, model inference, and metric computation. VLMEvalKit supports generation-based evaluation methods, allowing models to produce textual responses to visual inputs while measuring performance through techniques such as exact matching or language-model-assisted answer extraction.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    CodeGen

    CodeGen

    Open-source model for program synthesis

    ...The project also includes training infrastructure and model checkpoints that allow researchers to experiment with different model sizes and training configurations. Its architecture and training approach enable the models to perform competitively with proprietary coding models on benchmark tasks.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    Coconut

    Coconut

    Training Large Language Model to Reason in a Continuous Latent Space

    Coconut is the official PyTorch implementation of the research paper “Training Large Language Models to Reason in a Continuous Latent Space.” The framework introduces a novel method for enhancing large language models (LLMs) with continuous latent reasoning steps, enabling them to generate and refine reasoning chains within a learned latent space rather than relying solely on discrete symbolic reasoning. It supports training across multiple reasoning paradigms—including standard...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    Qwen-Image

    Qwen-Image

    Qwen-Image is a powerful image generation foundation model

    Qwen-Image is a powerful 20-billion parameter foundation model designed for advanced image generation and precise editing, with a particular strength in complex text rendering across diverse languages, especially Chinese. Built on the MMDiT architecture, it achieves remarkable fidelity in integrating text seamlessly into images while preserving typographic details and layout coherence. The model excels not only in text rendering but also in a wide range of artistic styles, including...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 22
    MiniMax-M2.1

    MiniMax-M2.1

    MiniMax M2.1, a SOTA model for real-world dev & agents.

    MiniMax-M2.1 is an open-source, state-of-the-art agentic language model released to democratize high-performance AI capabilities. It goes beyond a simple parameter upgrade, delivering major gains in coding, tool use, instruction following, and long-horizon planning. The model is designed to be transparent, controllable, and accessible, enabling developers to build autonomous systems without relying on closed platforms. MiniMax-M2.1 excels in real-world software engineering tasks, including...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 23
    TTRL

    TTRL

    Test-Time Reinforcement Learning

    TTRL is an open-source framework for test-time reinforcement learning in large language models, with a particular focus on reasoning tasks where ground-truth labels are not available during inference. The project addresses the problem of how to generate useful reward signals from unlabeled test-time data, and its central insight is that common test-time scaling practices such as majority voting can be repurposed into reward estimates for online reinforcement learning. This makes the...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    BrowserGym

    BrowserGym

    A Gym environment for web task automation

    BrowserGym is an open framework for web task automation research that exposes browser interaction as a Gym-style environment for training and evaluating agents. It is intended for researchers building web agents rather than for end users looking for a consumer automation product. The project provides a common environment where agents can interact with websites, execute tasks, and be evaluated against standardized benchmarks. One of its main strengths is that it bundles several important...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    KG-LLM-Papers

    KG-LLM-Papers

    Papers integrating knowledge graphs (KGs) and large language models

    ...The repository functions as a continuously updated index of scholarly work that investigates how structured knowledge representations can enhance the reasoning, factual accuracy, and interpretability of language models. It includes surveys, benchmark studies, and cutting-edge research that examine topics such as knowledge graph-guided prompting, retrieval-augmented generation, reasoning over structured data, and hybrid architectures combining symbolic and neural systems. By gathering these papers into a single organized repository, the project helps researchers quickly discover relevant literature and track the evolution of the field.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
Auth0 Logo