test free download - SourceForge

Empirical

Test and evaluate LLMs and model configurations

Empirical is the fastest way to test different LLMs and model configurations, across all the scenarios that matter for your application.

Downloads: 0 This Week

Last Update: 2024-11-13

See Project

promptfoo

Evaluate and compare LLM outputs, catch regressions, improve prompts

...Use built-in metrics, LLM-graded evals, or define your own custom metrics. Compare prompts and model outputs side-by-side, or integrate the library into your existing test/CI workflow. Use OpenAI, Anthropic, and open-source models like Llama and Vicuna, or integrate custom API providers for any LLM API.

Downloads: 2 This Week

Last Update: 2 days ago

See Project

Opik

Open-source end-to-end LLM Development Platform

...Run experiments with different prompts and evaluate against a test set. Choose and run pre-configured evaluation metrics or define your own with our convenient SDK library. Consult built-in LLM judges for complex issues like hallucination detection, factuality, and moderation.

Downloads: 3 This Week

Last Update: 1 day ago

See Project

Langflow

Low-code app builder for RAG and multi-agent AI applications

Langflow is a low-code app builder for RAG and multi-agent AI applications. It’s Python-based and agnostic to any model, API, or database.

Downloads: 6 This Week

Last Update: 2 days ago

See Project

langrocks

Tools like web browser, computer access and code runner for LLMs

Langrocks is a programming language experimentation toolkit that enables developers to create, test, and optimize custom programming languages.

Downloads: 0 This Week

Last Update: 2024-11-21

See Project

MiniMax-M1

Open-weight, large-scale hybrid-attention reasoning model

MiniMax-M1 is presented as the world’s first open-weight, large-scale hybrid-attention reasoning model, designed to push the frontier of long-context, tool-using, and deeply “thinking” language models. It is built on the MiniMax-Text-01 foundation and keeps the same massive parameter budget, but reworks the attention and training setup for better reasoning and test-time compute scaling. Architecturally, it combines Mixture-of-Experts layers with lightning attention, enabling the model to support a native context length of 1 million tokens while using far fewer FLOPs than comparable reasoning models for very long generations. The team emphasizes efficient scaling of test-time compute: at 100K-token generation lengths, M1 reportedly uses only about 25 percent of the FLOPs of some competing models, making extended “think step” traces more feasible. ...

Downloads: 0 This Week

Last Update: 2025-12-01

See Project

Automated Interpretability

Code for Language models can explain neurons in language models paper

...It includes a “neuron explainer” component that, given a target neuron or latent feature, proposes natural language explanations or heuristics (e.g. “this neuron activates when the input has property X”) and then simulates activation behavior across example inputs to test whether the explanation holds. The project also contains a “neuron viewer” web component for browsing neurons, explanations, and activation patterns, making it more interactive and exploratory.

Downloads: 2 This Week

Last Update: 2025-10-03

See Project

Pezzo

Open-source, developer-first LLMOps platform

Pezzo enables you to build, test, monitor and instantly ship AI all in one platform, while constantly optimizing for cost and performance. Packed with powerful features to streamline your workflow, so you can focus on what matters. Pezzo is a fully cloud-native and open-source LLMOps platform. Seamlessly observe and monitor your AI operations, troubleshoot issues, save up to 90% on costs and latency, collaborate and manage your prompts in one place, and instantly deliver AI changes.

Downloads: 0 This Week

Last Update: 2024-11-13

See Project

Streamline Analyst

AI agent that streamlines the entire process of data analysis

...This Data Analysis Agent effortlessly automates all the tasks such as data cleaning, preprocessing, and even complex operations like identifying target objects, partitioning test sets, and selecting the best-fit models based on your data. With Streamline Analyst, results visualization and evaluation become seamless.

Downloads: 0 This Week

Last Update: 2024-09-23

See Project

Cake

Distributed LLM and StableDiffusion inference

...Unlike many simple proxies, Cake can act as a full connection broker: it can bind to arbitrary interfaces, handle simultaneous upstream/downstream sessions, and apply traffic rules on the fly. This makes it suitable for troubleshooting tricky network behavior, simulating network conditions, or chaining services in a modular test environment.

Downloads: 0 This Week

Last Update: 2025-12-12

See Project

BIG-bench

Beyond the Imitation Game collaborative benchmark for measuring

BIG-bench (Beyond the Imitation Game Benchmark) is a large, collaborative benchmark suite designed to probe the capabilities and limitations of large language models across hundreds of diverse tasks. Rather than focusing on a single metric or domain, it aggregates many hand-authored tasks that test reasoning, commonsense, math, linguistics, ethics, and creativity. Tasks are intentionally heterogeneous: some are multiple-choice with exact scoring, others are free-form generation judged by model-based or human evaluation. The suite provides a common JSON task format and an evaluation harness so research groups can contribute new tasks and reproduce results consistently. ...

Downloads: 1 This Week

Last Update: 2025-10-09

See Project

Aviary

Ray Aviary - evaluate multiple LLMs easily

Aviary is an LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs. Providing an extensive suite of pre-configured open source LLMs, with defaults that work out of the box. Supporting Transformer models hosted on Hugging Face Hub or present on local disk. Aviary has native support for autoscaling and multi-node deployments thanks to Ray and Ray Serve. Aviary can scale to zero and create new model replicas (each composed of multiple GPU workers) in...

Downloads: 0 This Week

Last Update: 2024-01-18

See Project

unit-minions

AI R&D Efficiency Improvement Research: Do-It-Yourself Training LoRA

"AI R&D Efficiency Improvement Research: Do-It-Yourself Training LoRA", including Llama (Alpaca LoRA) model, ChatGLM (ChatGLM Tuning) related Lora training. Training content: user story generation, test code generation, code-assisted generation, text to SQL, text generation code.

Downloads: 0 This Week

Last Update: 2023-08-25

See Project

LLaMA.go

llama.go is like llama.cpp in pure Golang

llama.go is like llama.cpp in pure Golang. The code of the project is based on the legendary ggml.cpp framework of Georgi Gerganov written in C++ with the same attitude to performance and elegance. Both models store FP32 weights, so you'll needs at least 32Gb of RAM (not VRAM or GPU RAM) for LLaMA-7B. Double to 64Gb for LLaMA-13B.

Downloads: 1 This Week

Last Update: 2023-08-25

See Project

PromptCraft-Robotics

Community for applying LLMs to robotics and a robot simulator

The PromptCraft-Robotics repository serves as a community for people to test and share interesting prompting examples for large language models (LLMs) within the robotics domain. We also provide a sample robotics simulator (built on Microsoft AirSim) with ChatGPT integration for users to get started. We currently focus on OpenAI's ChatGPT, but we also welcome examples from other LLMs (for example open-sourced models or others with API access such as GPT-3 and Codex).

Downloads: 0 This Week

Last Update: 2023-08-25

See Project

DomE

Implements a reference architecture for creating information systems

DomE Experiment is an implementation of a reference architecture for creating information systems from the automated evolution of the domain model. The architecture comprises elements that guarantee user access through automatically generated interfaces for various devices, integration with external information sources, data and operations security, automatic generation of analytical information, and automatic control of business processes. All these features are generated from the domain...

Downloads: 0 This Week

Last Update: 2023-03-22

See Project

$Grade School Math$

Grade School Math

8.5K high quality grade school math problems

The grade-school-math repository (sometimes called GSM8K) is a curated dataset of 8,500 high-quality grade school math word problems intended for evaluating mathematical reasoning capabilities of language models. It is structured into 7,500 training problems and 1,000 test problems. These aren’t trivial exercises — many require multi-step reasoning, combining arithmetic operations, and handling intermediate steps (e.g. “If she sold half as many in May… how many in total?”). The problems are written by human authors (not automatically generated) to ensure linguistic variety and realism. The repository maintains strict formatting (e.g. ...

Downloads: 0 This Week

Last Update: 2025-10-03

See Project

Search Results for "test"

Showing 17 open source projects for "test"

Empirical

promptfoo

Opik

Langflow

langrocks

MiniMax-M1

Automated Interpretability

Pezzo

Streamline Analyst

Cake

BIG-bench

Aviary

unit-minions

LLaMA.go

PromptCraft-Robotics

DomE

Grade School Math

Search Results for "test"

Showing 17 open source projects for "test"

Empirical

promptfoo

Opik

Langflow

langrocks

MiniMax-M1

Automated Interpretability

Pezzo

Streamline Analyst

Cake

BIG-bench

Aviary

unit-minions

LLaMA.go

PromptCraft-Robotics

DomE

Grade School Math

Related Searches

Related Categories