tests free download - SourceForge

HumanEval

Code for the paper "Evaluating Large Language Models Trained on Code"

human-eval is a benchmark dataset and evaluation framework created by OpenAI for measuring the ability of language models to generate correct code. It consists of hand-written programming problems with unit tests, designed to assess functional correctness rather than superficial metrics like text similarity. Each task includes a natural language prompt and a function signature, requiring the model to generate an implementation that passes all provided tests. The benchmark has become a standard for evaluating code generation models, including those in the Codex and GPT families. ...

Downloads: 3 This Week

Last Update: 6 days ago

See Project

rtk

CLI proxy that reduces LLM token consumption

rtk is an open-source command-line proxy designed to optimize interactions between AI coding agents and the terminal by reducing unnecessary token consumption. When AI assistants execute shell commands during software development tasks, the resulting terminal output often contains large amounts of repetitive or irrelevant information that can overwhelm the model’s context window. RTK intercepts these command outputs and compresses them into concise summaries before sending them to the...

Downloads: 44 This Week

Last Update: 18 hours ago

See Project

Agentless

An agentless approach to automatically solve software development

...It then generates multiple candidate patches for the identified locations using language model reasoning and diff-style edits. In the final stage, the framework validates potential patches by running regression tests and additional reproduction tests to confirm whether the fix resolves the original error. Based on these results, the system ranks the candidate patches and selects the most reliable solution to submit.

Downloads: 0 This Week

Last Update: 2026-03-06

See Project

LLM Colosseum

Benchmark LLMs by fighting in Street Fighter 3

...The system places language models inside the environment of the classic video game Street Fighter III, where they must interpret the game state and decide which actions to perform during combat. This setup creates a dynamic environment that tests reasoning, situational awareness, and decision-making abilities in real time. Instead of relying purely on reward signals as in reinforcement learning agents, the models analyze contextual information and generate strategic actions based on the game environment. Performance is evaluated using a competitive ranking system that assigns models an ELO rating based on their results across matches against other models.

Downloads: 6 This Week

Last Update: 2026-03-07

See Project

GLM-4.5

GLM-4.5: Open-source LLM for intelligent agents by Z.ai

GLM-4.5 is a cutting-edge open-source large language model designed by Z.ai for intelligent agent applications. The flagship GLM-4.5 model has 355 billion total parameters with 32 billion active parameters, while the compact GLM-4.5-Air version offers 106 billion total parameters and 12 billion active parameters. Both models unify reasoning, coding, and intelligent agent capabilities, providing two modes: a thinking mode for complex reasoning and tool usage, and a non-thinking mode for...

1 Review

Downloads: 37 This Week

Last Update: 2026-02-01

See Project

promptmap2

A security scanner for custom LLM applications

promptmap is an automated security scanner for custom LLM applications that focuses on prompt injection and related attack classes. The project supports both white-box and black-box testing, which means it can either run tests directly against a known model and system prompt configuration or attack an external HTTP endpoint without internal access. Its scanning workflow uses a dual-LLM architecture in which one model acts as the target being tested and another acts as a controller that evaluates whether an attack succeeded. The repository emphasizes broad coverage, including test rules for prompt stealing, jailbreaks, harmful content generation, hate-related outputs, social bias, and distraction attacks. ...

Downloads: 0 This Week

Last Update: 2026-03-10

See Project

Empirical

Test and evaluate LLMs and model configurations

Empirical is the fastest way to test different LLMs and model configurations, across all the scenarios that matter for your application.

Downloads: 0 This Week

Last Update: 2024-11-13

See Project

OSS-Fuzz Gen

LLM powered fuzzing via OSS-Fuzz

OSS-Fuzz-Gen is a companion project that helps automatically create or improve fuzz targets for open-source codebases, aiming to increase coverage in OSS-Fuzz with minimal maintainer effort. It analyses a library’s APIs, examples, and tests to propose harnesses that exercise parsers, decoders, or protocol handlers—precisely the code where fuzzing pays off. The system integrates with modern LLM-assisted workflows to draft harness code and then iterates based on build errors or low coverage signals. Importantly, it aligns with OSS-Fuzz conventions, generating corpus seeds, build rules, and sanitizer settings so projects can plug in quickly. ...

Downloads: 0 This Week

Last Update: 2025-10-12

See Project

Search Results for "tests"

Showing 8 open source projects for "tests"

HumanEval

rtk

Agentless

LLM Colosseum

GLM-4.5

promptmap2

Empirical

OSS-Fuzz Gen

Search Results for "tests"

Showing 8 open source projects for "tests"

HumanEval

rtk

Agentless

LLM Colosseum

GLM-4.5

promptmap2

Empirical

OSS-Fuzz Gen

Related Searches

Related Categories