openbench is an open-source, provider-agnostic evaluation infrastructure designed to run standardized, reproducible benchmarks on large language models (LLMs), enabling fair comparison across different model providers. It bundles dozens of evaluation suites — covering knowledge, reasoning, math, code, science, reading comprehension, long-context recall, graph reasoning, and more — so users don’t need to assemble disparate datasets themselves. With a simple CLI interface (e.g. bench eval <benchmark> --model <model-id>), you can quickly evaluate any model supported by Groq or other providers (OpenAI, Anthropic, HuggingFace, local models, etc.). openbench also supports private/local evaluations: you can integrate your own custom benchmarks or data (e.g. internal test suites, domain-specific tasks) to evaluate models in a privacy-preserving way.
Features
- 30+ built-in benchmark suites spanning knowledge, math, reasoning, code, science, graph tasks and more
- Provider-agnostic: works with many LLM providers including Groq, OpenAI, Anthropic, HuggingFace, local models, and others
- Simple CLI commands for listing, describing, and evaluating benchmarks (bench list, bench describe, bench eval, etc.)
- Support for custom/local benchmarks so users can evaluate domain-specific tasks privately
- Consistent scoring and result logging for reproducible, comparable evaluation outcomes
- Extensible architecture that simplifies adding new benchmarks or evaluation metrics