HumanEval

human-eval is a benchmark dataset and evaluation framework created by OpenAI for measuring the ability of language models to generate correct code. It consists of hand-written programming problems with unit tests, designed to assess functional correctness rather than superficial metrics like text similarity. Each task includes a natural language prompt and a function signature, requiring the model to generate an implementation that passes all provided tests. The benchmark has become a standard for evaluating code generation models, including those in the Codex and GPT families. Researchers can use the dataset to run reproducible comparisons across models and track improvements in functional code synthesis. By focusing on correctness through execution, human-eval provides a rigorous and practical way to evaluate programming capabilities in AI systems.

Features

Collection of hand-written programming problems with unit tests
Measures functional correctness of generated code
Includes natural language prompts and function signatures
Standard benchmark for evaluating code generation models
Enables reproducible comparisons across different models
Widely used in assessing Codex, GPT, and other code models

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow HumanEval

HumanEval Web Site

Other Useful Business Software

MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free

Rate This Project

User Reviews

Be the first to post a review of HumanEval!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Artificial Intelligence Software

Registered

4 days ago

Similar Business Software

DeepEval

DeepEval is a simple-to-use, open source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval,...

See Software
Windsurf Editor

The Windsurf Editor is a free AI-powered IDE and AI coding assistant that accelerates development by providing intelligent code generation and agents in over 70 programming languages and more than 40 IDEs, including VSCode, JetBrains, and Jupyter Notebooks. With Windsurf, developers can write...

See Software
GPT-J

GPT-J is a cutting-edge language model created by the research organization EleutherAI. In terms of performance, GPT-J exhibits a level of proficiency comparable to that of OpenAI's renowned GPT-3 model in a range of zero-shot tasks. Notably, GPT-J has demonstrated the ability to surpass GPT-3...

See Software
Traceloop

Traceloop is a comprehensive observability platform designed to monitor, debug, and test the quality of outputs from Large Language Models (LLMs). It offers real-time alerts for unexpected output quality changes, execution tracing for every request, and the ability to gradually roll out changes...

See Software
LM-Kit.NET

LM-Kit.NET is a cutting-edge, high-level inference SDK designed specifically to bring the advanced capabilities of Large Language Models (LLM) into the C# ecosystem. Tailored for developers working within .NET, LM-Kit.NET provides a comprehensive suite of powerful Generative AI tools, making...

See Software
Vertex AI

Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery...

See Software

Report inappropriate content

HumanEval

Code for the paper "Evaluating Large Language Models Trained on Code"

Get an email when there's a new version of HumanEval

Features

Project Samples

Project Activity

Categories

License

Follow HumanEval

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered