Arthur Bench

Bench is a tool for evaluating LLMs for production use cases. Whether you are comparing different LLMs, considering different prompts, or testing generation hyperparameters like temperature and # tokens, Bench provides one touch point for all your LLM performance evaluation.

Features

To standardize the workflow of LLM evaluation with a common interface across tasks and use cases
To test whether open source LLMs can do as well as the top closed-source LLM API providers on your specific data
To translate the rankings on LLM leaderboards and benchmarks into scores that you care about for your actual use case
Bench provides one touch point for all your LLM performance evaluation
Install Bench to your python environment with optional dependencies for serving results locally
Alternatively, install Bench to your python environment with minimum dependencies

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow Arthur Bench

Arthur Bench Web Site

User Reviews

Be the first to post a review of Arthur Bench!

Additional Project Details

Programming Language

TypeScript

Related Categories

TypeScript Artificial Intelligence Software

Registered

2023-08-21

Similar Business Software

Parea

The prompt engineering platform to experiment with different prompt versions, evaluate and compare prompts across a suite of tests, optimize prompts with one-click, share, and more. Optimize your AI development workflow. Key features to help you get and identify the best prompts for your...

See Software
BenchLLM

Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies. We are a team of engineers who love building AI products. We don't want to compromise between the power and...

See Software
Athina AI

Monitor your LLMs in production, and discover and fix hallucinations, accuracy, and quality-related errors with LLM outputs seamlessly. Evaluate your outputs for hallucinations, misinformation, quality issues, and other bad outputs. Configurable for any LLM use case. Segment your data to analyze...

See Software

Report inappropriate content