Evals
Evals is a framework for evaluating LLMs and LLM systems
The openai/evals repository is a framework and registry for evaluating large language models and systems built with LLMs. It’s designed to let you define “evals” (evaluation tasks) in a structured way and run them against different models or agents, with the ability to score, compare, and analyze results. The framework supports templated YAML eval definitions, solver-based evaluations, custom metrics, and composition of multi-step evaluations. It includes utilities and APIs to plug in...