EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

Overview

Decision SnapshotNeeds Validation

The method meaningfully strengthens functional testing for Python benchmarks and shows consistent drops across 26 models, but it depends on correct ground-truths and has costs for large-scale execution.

Citations171

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

EvalPlus is an automated testing framework that strengthens code-generation benchmarks by combining ChatGPT-produced seed inputs with type-aware mutations and differential testing. Applied to HumanEval, EvalPlus expands the test suite by ~80× (avg 9.6 → 764.1 tests), finds previously undetected incorrect model outputs that lower pass@k by 19–29% on evaluated models, discovers 18 bad reference implementations (11% of tasks), and produces a compact mini-suite (47× smaller) that retains most detection power. The code and datasets are open-sourced.

Problem Statement

Existing code-generation benchmarks use too few and too-simple tests (on average <10 per problem), which lets incorrect solutions pass and hides real model weaknesses. This paper asks whether reported pass rates actually reflect functional correctness and proposes a larger, automated test suite plus reduction techniques to answer that.

Main Contribution

Diagnosed test insufficiency in popular code benchmarks and measured its practical impact on model rankings.

Designed EvalPlus: a test-generation pipeline that uses LLM (ChatGPT) seed inputs, type-aware mutation, and differential testing against ground-truth code.

Key Findings

Automated augmentation increases tests per task from single-digit to hundreds.

NumbersHumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

Practical UseUse large, diverse test suites instead of a few handcrafted tests to reduce false confidence in model correctness.

Evidence RefTable 2

Measured pass@k for many models falls substantially under stronger testing.

Numberspass@1/10/100 drops up to 19.3% / 24.9% / 28.9%

Practical UseExpect published pass@k numbers on small test suites to overestimate real functional correctness; re-evaluate models with richer tests.

Evidence RefAbstract & Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
average tests per task	764.1	9.6 (original HUMANEVAL)	+754.5 tests (≈80×)	HUMANEVAL → HUMANEVAL+	EvalPlus augments HumanEval with ChatGPT seeds + mutation	Table 2
pass@1 drop (avg observed max)	up to 19.3%	pass@1 on original HUMANEVAL	-19.3% (drop vs original)	HUMANEVAL vs HUMANEVAL+	Stronger test-suite finds previously undetected failures	Abstract & Table 3

What To Try In 7 Days

Run EvalPlus (or similar) on your internal code-gen evaluation to expand tests automatically.

Use an LLM to create structured seed inputs, then apply type-aware mutation to scale tests.

Cross-check model outputs against a validated ground-truth implementation before deployment decisions.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/evalplus/evalplus

Data URLs

https://github.com/evalplus/evalplus

Risks & Boundaries

Limitations

Seed generation uses ChatGPT which incurs cost and may reflect its biases.

EvalPlus relies on a correct ground-truth implementation for differential testing.

When Not To Use

You lack a trusted ground-truth oracle for differential testing.

You require formal proofs of correctness rather than empirical testing.

Failure Modes

False positives from invalid inputs that slip past contract checks.

Bias in ChatGPT seeds could steer mutations away from rare but important cases.

Core Entities

Models

GPT-4ChatGPTWizardCoder-CodeLlamaPhind-CodeLlamaCODELLAMACodeGenCodeGen2StarCoderINCODERSantaCoderVICUNAMISTRALPolyCoderGPT-JGPT-NEO

Metrics

pass@1pass@10pass@100pass@1⋆ (greedy)test countmutation killsbranch coverage

Datasets

HUMANEVALHUMANEVAL+HUMANEVAL+-MINI

Benchmarks

HUMANEVALHUMANEVAL+

Context Entities

Models

CodeGenCodeGen2StarCoderCODET5+StableLM

Datasets

MBPPHUMANEVAL-XMultiPL-E

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Automated augmentation increases tests per task from single-digit to hundreds.

Measured pass@k for many models falls substantially under stronger testing.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding