scoring free download

Prometheus-Eval

Evaluate your LLM's response with Prometheus and GPT4

...It also provides training data and utilities for fine-tuning evaluator models so they can assess outputs according to custom scoring rubrics such as helpfulness, accuracy, or style.

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

WebGLM

An Efficient Web-enhanced Question Answering System

...WebGLM introduces several components that coordinate this process, including a retrieval module that selects relevant web documents, a generator that produces answers, and a scoring system that evaluates the quality of generated responses. The architecture aims to improve the reliability and usefulness of AI systems that answer questions about current or external knowledge sources.

Downloads: 0 This Week

Last Update: 2026-03-06

See Project

VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs)

VLMEvalKit is an open-source evaluation toolkit designed for benchmarking large vision-language models that combine visual understanding with natural language reasoning. The toolkit provides a unified framework that allows researchers and developers to evaluate multimodal models across a wide range of datasets and standardized benchmarks with minimal setup. Instead of requiring complex data preparation pipelines or multiple repositories for each benchmark, the system enables evaluation...

Downloads: 1 This Week

Last Update: 2026-03-05

See Project

Agent Behavior Monitoring

The open source post-building layer for agents

Agent Behavior Monitoring is an open-source framework designed to monitor, evaluate, and improve the behavior of AI agents operating in real or simulated environments. The system focuses on agent behavior monitoring by collecting interaction data and analyzing how agents perform across different scenarios and tasks. Developers can use the framework to observe agent actions in both online production environments and offline evaluation settings, making it useful for debugging and performance...

Downloads: 1 This Week

Last Update: 2026-05-27

See Project

LLM-Pruner

On the Structural Pruning of Large Language Models

LLM-Pruner is an open-source framework designed to compress large language models through structured pruning techniques while maintaining their general capabilities. Large language models often require enormous computational resources, making them expensive to deploy and inefficient for many practical applications. LLM-Pruner addresses this issue by identifying and removing non-essential components within transformer architectures, such as redundant attention heads or feed-forward...

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

uqlm

Uncertainty Quantification for Language Models, is a Python package

UQLM is a Python library developed to detect hallucinations and quantify uncertainty in the outputs of large language models. The system implements a variety of uncertainty quantification techniques that assign confidence scores to model responses. These scores help developers determine how likely a generated answer is to contain errors or fabricated information. The library includes both black-box and white-box approaches to uncertainty estimation. Black-box methods evaluate model outputs...

Downloads: 0 This Week

Last Update: 2026-06-08

See Project

Hallucination Leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations

Hallucination Leaderboard is an open research project that tracks and compares the tendency of large language models to produce hallucinated or inaccurate information when generating summaries. The project provides a standardized benchmark that evaluates different models using a dedicated hallucination detection system known as the Hallucination Evaluation Model. Each model is tested on document summarization tasks to measure how often generated responses introduce information that is not...

Downloads: 0 This Week

Last Update: 2026-05-11

See Project

Empirical

Test and evaluate LLMs and model configurations

Empirical is the fastest way to test different LLMs and model configurations, across all the scenarios that matter for your application.

Downloads: 0 This Week

Last Update: 2024-11-13

See Project

Automated Interpretability

Code for Language models can explain neurons in language models paper

The automated-interpretability repository implements tools and pipelines for automatically generating, simulating, and scoring explanations of neuron (or latent feature) behavior in neural networks. Instead of relying purely on manual, ad hoc interpretability probing, this repo aims to scale interpretability by using algorithmic methods that produce candidate explanations and assess their quality. It includes a “neuron explainer” component that, given a target neuron or latent feature, proposes natural language explanations or heuristics (e.g. ...

Downloads: 0 This Week

Last Update: 2025-10-03

See Project

BIG-bench

Beyond the Imitation Game collaborative benchmark for measuring

...Rather than focusing on a single metric or domain, it aggregates many hand-authored tasks that test reasoning, commonsense, math, linguistics, ethics, and creativity. Tasks are intentionally heterogeneous: some are multiple-choice with exact scoring, others are free-form generation judged by model-based or human evaluation. The suite provides a common JSON task format and an evaluation harness so research groups can contribute new tasks and reproduce results consistently. It emphasizes robustness analysis—looking at scale trends, calibration, and areas where models systematically fail—to guide model development beyond raw accuracy. ...

Downloads: 0 This Week

Last Update: 2025-10-09

See Project

Search Results for "scoring"

Showing 10 open source projects for "scoring"

Prometheus-Eval

WebGLM

VLMEvalKit

Agent Behavior Monitoring

LLM-Pruner

uqlm

Hallucination Leaderboard

Empirical

Automated Interpretability

BIG-bench

Search Results for "scoring"

Showing 10 open source projects for "scoring"

Prometheus-Eval

WebGLM

VLMEvalKit

Agent Behavior Monitoring

LLM-Pruner

uqlm

Hallucination Leaderboard

Empirical

Automated Interpretability

BIG-bench

Related Searches

Related Categories