performance testing free download

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

...These environments require agents to interpret instructions, take actions, and adapt their strategies based on feedback from the environment. AgentBench also includes an evaluation framework that measures success rates, rewards, and task completion performance across different agent implementations. By testing models across diverse scenarios, the benchmark highlights strengths and weaknesses in reasoning, long-term planning, and tool usage.

Downloads: 0 This Week

Last Update: 2026-03-05

See Project

LangWatch

The platform for LLM evaluations and AI agent testing

LangWatch is an open-source observability and monitoring platform designed to help developers evaluate and improve applications built with large language models. The platform provides tools for tracking model interactions, analyzing prompt behavior, and identifying issues such as hallucinations, latency problems, or unexpected responses. By collecting telemetry data from AI applications, LangWatch allows developers to understand how their systems perform in real-world usage scenarios. The...

Downloads: 2 This Week

Last Update: 6 days ago

See Project

Agent Behavior Monitoring

The open source post-building layer for agents

Agent Behavior Monitoring is an open-source framework designed to monitor, evaluate, and improve the behavior of AI agents operating in real or simulated environments. The system focuses on agent behavior monitoring by collecting interaction data and analyzing how agents perform across different scenarios and tasks. Developers can use the framework to observe agent actions in both online production environments and offline evaluation settings, making it useful for debugging and performance...

Downloads: 5 This Week

Last Update: 2026-04-09

See Project

Paddler

Open-source LLM load balancer and serving platform for hosting LLMs

Paddler is an open-source LLM infrastructure platform designed to deploy, manage, and scale large language models on private infrastructure. The system acts as a specialized load balancer and serving layer for language models, enabling organizations to run inference workloads without relying on external API providers. It supports running models locally through engines such as llama.cpp while distributing requests across multiple compute nodes to improve performance and reliability. The...

Downloads: 0 This Week

Last Update: 2026-04-30

See Project

Hallucination Leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations

Hallucination Leaderboard is an open research project that tracks and compares the tendency of large language models to produce hallucinated or inaccurate information when generating summaries. The project provides a standardized benchmark that evaluates different models using a dedicated hallucination detection system known as the Hallucination Evaluation Model. Each model is tested on document summarization tasks to measure how often generated responses introduce information that is not...

Downloads: 1 This Week

Last Update: 2026-04-29

See Project

$Grade School Math$

Grade School Math

8.5K high quality grade school math problems

The grade-school-math repository (sometimes called GSM8K) is a curated dataset of 8,500 high-quality grade school math word problems intended for evaluating mathematical reasoning capabilities of language models. It is structured into 7,500 training problems and 1,000 test problems. These aren’t trivial exercises — many require multi-step reasoning, combining arithmetic operations, and handling intermediate steps (e.g. “If she sold half as many in May… how many in total?”). The problems are...

Downloads: 0 This Week

Last Update: 2025-10-03

See Project

Search Results for "performance testing"

6 projects for "performance testing" with 2 filters applied:

AgentBench

LangWatch

Agent Behavior Monitoring

Paddler

Hallucination Leaderboard

Grade School Math

Search Results for "performance testing"

6 projects for "performance testing" with 2 filters applied:

AgentBench

LangWatch

Agent Behavior Monitoring

Paddler

Hallucination Leaderboard

Grade School Math

Related Categories