performance testing free download

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

...These environments require agents to interpret instructions, take actions, and adapt their strategies based on feedback from the environment. AgentBench also includes an evaluation framework that measures success rates, rewards, and task completion performance across different agent implementations. By testing models across diverse scenarios, the benchmark highlights strengths and weaknesses in reasoning, long-term planning, and tool usage.

Downloads: 0 This Week

Last Update: 2026-03-05

See Project

langrocks

Tools like web browser, computer access and code runner for LLMs

Langrocks is a programming language experimentation toolkit that enables developers to create, test, and optimize custom programming languages.

Downloads: 1 This Week

Last Update: 2024-11-21

See Project

Agent Behavior Monitoring

The open source post-building layer for agents

Agent Behavior Monitoring is an open-source framework designed to monitor, evaluate, and improve the behavior of AI agents operating in real or simulated environments. The system focuses on agent behavior monitoring by collecting interaction data and analyzing how agents perform across different scenarios and tasks. Developers can use the framework to observe agent actions in both online production environments and offline evaluation settings, making it useful for debugging and performance...

Downloads: 5 This Week

Last Update: 2026-04-09

See Project

Mosec

A high-performance ML model serving framework, offers dynamic batching

Mosec is a high-performance and flexible model-serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

Downloads: 1 This Week

Last Update: 2026-04-15

See Project

Hallucination Leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations

Hallucination Leaderboard is an open research project that tracks and compares the tendency of large language models to produce hallucinated or inaccurate information when generating summaries. The project provides a standardized benchmark that evaluates different models using a dedicated hallucination detection system known as the Hallucination Evaluation Model. Each model is tested on document summarization tasks to measure how often generated responses introduce information that is not...

Downloads: 1 This Week

Last Update: 2026-04-29

See Project

$Grade School Math$

Grade School Math

8.5K high quality grade school math problems

The grade-school-math repository (sometimes called GSM8K) is a curated dataset of 8,500 high-quality grade school math word problems intended for evaluating mathematical reasoning capabilities of language models. It is structured into 7,500 training problems and 1,000 test problems. These aren’t trivial exercises — many require multi-step reasoning, combining arithmetic operations, and handling intermediate steps (e.g. “If she sold half as many in May… how many in total?”). The problems are...

Downloads: 0 This Week

Last Update: 2025-10-03

See Project

Search Results for "performance testing"

Showing 6 open source projects for "performance testing"

AgentBench

langrocks

Agent Behavior Monitoring

Mosec

Hallucination Leaderboard

Grade School Math

Search Results for "performance testing"

Showing 6 open source projects for "performance testing"

AgentBench

langrocks

Agent Behavior Monitoring

Mosec

Hallucination Leaderboard

Grade School Math

Related Categories