EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

May 2, 20238 min

Overview

Decision SnapshotNeeds Validation

The method meaningfully strengthens functional testing for Python benchmarks and shows consistent drops across 26 models, but it depends on correct ground-truths and has costs for large-scale execution.

Citations171

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Who Should Care

Summary TLDR

EvalPlus is an automated testing framework that strengthens code-generation benchmarks by combining ChatGPT-produced seed inputs with type-aware mutations and differential testing. Applied to HumanEval, EvalPlus expands the test suite by ~80× (avg 9.6 → 764.1 tests), finds previously undetected incorrect model outputs that lower pass@k by 19–29% on evaluated models, discovers 18 bad reference implementations (11% of tasks), and produces a compact mini-suite (47× smaller) that retains most detection power. The code and datasets are open-sourced.

Problem Statement

Existing code-generation benchmarks use too few and too-simple tests (on average <10 per problem), which lets incorrect solutions pass and hides real model weaknesses. This paper asks whether reported pass rates actually reflect functional correctness and proposes a larger, automated test suite plus reduction techniques to answer that.

Main Contribution

Diagnosed test insufficiency in popular code benchmarks and measured its practical impact on model rankings.

Designed EvalPlus: a test-generation pipeline that uses LLM (ChatGPT) seed inputs, type-aware mutation, and differential testing against ground-truth code.

Key Findings

Automated augmentation increases tests per task from single-digit to hundreds.

NumbersHumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

Practical UseUse large, diverse test suites instead of a few handcrafted tests to reduce false confidence in model correctness.

Evidence RefTable 2

Measured pass@k for many models falls substantially under stronger testing.

Numberspass@1/10/100 drops up to 19.3% / 24.9% / 28.9%

Practical UseExpect published pass@k numbers on small test suites to overestimate real functional correctness; re-evaluate models with richer tests.

Evidence RefAbstract & Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
average tests per task764.19.6 (original HUMANEVAL)+754.5 tests (≈80×)HUMANEVAL → HUMANEVAL+EvalPlus augments HumanEval with ChatGPT seeds + mutationTable 2
pass@1 drop (avg observed max)up to 19.3%pass@1 on original HUMANEVAL-19.3% (drop vs original)HUMANEVAL vs HUMANEVAL+Stronger test-suite finds previously undetected failuresAbstract & Table 3

What To Try In 7 Days

Run EvalPlus (or similar) on your internal code-gen evaluation to expand tests automatically.

Use an LLM to create structured seed inputs, then apply type-aware mutation to scale tests.

Cross-check model outputs against a validated ground-truth implementation before deployment decisions.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Seed generation uses ChatGPT which incurs cost and may reflect its biases.

EvalPlus relies on a correct ground-truth implementation for differential testing.

When Not To Use

You lack a trusted ground-truth oracle for differential testing.

You require formal proofs of correctness rather than empirical testing.

Failure Modes

False positives from invalid inputs that slip past contract checks.

Bias in ChatGPT seeds could steer mutations away from rare but important cases.

Core Entities

Models

GPT-4ChatGPTWizardCoder-CodeLlamaPhind-CodeLlamaCODELLAMACodeGenCodeGen2StarCoderINCODERSantaCoderVICUNAMISTRALPolyCoderGPT-JGPT-NEO

Metrics

pass@1pass@10pass@100pass@1⋆ (greedy)test countmutation killsbranch coverage

Datasets

HUMANEVALHUMANEVAL+HUMANEVAL+-MINI

Benchmarks

HUMANEVALHUMANEVAL+

Context Entities

Models

CodeGenCodeGen2StarCoderCODET5+StableLM

Datasets

MBPPHUMANEVAL-XMultiPL-E