EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

May 2, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

171

Authors

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang

Links

Abstract / PDF

Why It Matters For Business

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Summary TLDR

EvalPlus is an automated testing framework that strengthens code-generation benchmarks by combining ChatGPT-produced seed inputs with type-aware mutations and differential testing. Applied to HumanEval, EvalPlus expands the test suite by ~80× (avg 9.6 → 764.1 tests), finds previously undetected incorrect model outputs that lower pass@k by 19–29% on evaluated models, discovers 18 bad reference implementations (11% of tasks), and produces a compact mini-suite (47× smaller) that retains most detection power. The code and datasets are open-sourced.

Problem Statement

Existing code-generation benchmarks use too few and too-simple tests (on average <10 per problem), which lets incorrect solutions pass and hides real model weaknesses. This paper asks whether reported pass rates actually reflect functional correctness and proposes a larger, automated test suite plus reduction techniques to answer that.

Main Contribution

Diagnosed test insufficiency in popular code benchmarks and measured its practical impact on model rankings.

Designed EvalPlus: a test-generation pipeline that uses LLM (ChatGPT) seed inputs, type-aware mutation, and differential testing against ground-truth code.

Built HUMANEVAL+ (avg 764.1 tests/task, ~80× larger) and HUMANEVAL+-MINI (avg 16.1 tests/task, ~47× smaller) and fixed buggy ground-truths.

Evaluated 26 LLMs across sampling and greedy modes; showed pass@k drops of 19.3–28.9% on average and model re-ranking effects.

Released tools, tests and generated code to the public repository for reproducible evaluation.

Key Findings

Automated augmentation increases tests per task from single-digit to hundreds.

NumbersHumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

Measured pass@k for many models falls substantially under stronger testing.

Numberspass@1/10/100 drops up to 19.3% / 24.9% / 28.9%

Stronger tests change relative model rankings.

NumbersWizardCoder-CodeLlama and Phind-CodeLlama outperform ChatGPT on HUMANEVAL+ though not on original HumanEval

Reference implementations in HumanEval contain real bugs.

Numbers18 defective ground-truths = 11% of tasks

A small, optimized subset of tests preserves most detection power.

NumbersHUMANEVAL+-MINI avg tests 16.1 vs HUMANEVAL+ 764.1 (≈47× reduction)

Results

average tests per task

Value764.1

Baseline9.6 (original HUMANEVAL)

pass@1 drop (avg observed max)

Valueup to 19.3%

Baselinepass@1 on original HUMANEVAL

pass@10 drop (max observed)

Valueup to 24.9%

Baselinepass@10 on original HUMANEVAL

ground-truth defects found

Value18

Baseline164 tasks

reduced test-suite size

Value16.1

Baseline764.1 (HUMANEVAL+)

Who Should Care

What To Try In 7 Days

Run EvalPlus (or similar) on your internal code-gen evaluation to expand tests automatically.

Use an LLM to create structured seed inputs, then apply type-aware mutation to scale tests.

Cross-check model outputs against a validated ground-truth implementation before deployment decisions.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Seed generation uses ChatGPT which incurs cost and may reflect its biases.
  • EvalPlus relies on a correct ground-truth implementation for differential testing.
  • Current evaluation focuses on Python HumanEval; adapting to other languages requires work.
  • Generated tests may still miss semantic constraints not expressible in contracts or seeds.
  • One-hour mutation budget per task limits exhaustive exploration of very large input spaces.

When Not To Use

  • You lack a trusted ground-truth oracle for differential testing.
  • You require formal proofs of correctness rather than empirical testing.
  • Your code tasks are in languages or domains where type-aware mutation and seeds are hard to define.

Failure Modes

  • False positives from invalid inputs that slip past contract checks.
  • Bias in ChatGPT seeds could steer mutations away from rare but important cases.
  • Test-suite reduction tuned on known models may miss failures of a new, very different model.
  • Timing and resource limits can hide time-related correctness or performance bugs.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • WizardCoder-CodeLlama
  • Phind-CodeLlama
  • CODELLAMA
  • CodeGen
  • CodeGen2
  • StarCoder
  • INCODER
  • SantaCoder
  • VICUNA
  • MISTRAL
  • PolyCoder
  • GPT-J
  • GPT-NEO

Metrics

  • pass@1
  • pass@10
  • pass@100
  • pass@1⋆ (greedy)
  • test count
  • mutation kills
  • branch coverage

Datasets

  • HUMANEVAL
  • HUMANEVAL+
  • HUMANEVAL+-MINI

Benchmarks

  • HUMANEVAL
  • HUMANEVAL+

Context Entities

Models

  • CodeGen
  • CodeGen2
  • StarCoder
  • CODET5+
  • StableLM

Datasets

  • MBPP
  • HUMANEVAL-X
  • MultiPL-E