Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
171
Why It Matters For Business
Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.
Summary TLDR
EvalPlus is an automated testing framework that strengthens code-generation benchmarks by combining ChatGPT-produced seed inputs with type-aware mutations and differential testing. Applied to HumanEval, EvalPlus expands the test suite by ~80× (avg 9.6 → 764.1 tests), finds previously undetected incorrect model outputs that lower pass@k by 19–29% on evaluated models, discovers 18 bad reference implementations (11% of tasks), and produces a compact mini-suite (47× smaller) that retains most detection power. The code and datasets are open-sourced.
Problem Statement
Existing code-generation benchmarks use too few and too-simple tests (on average <10 per problem), which lets incorrect solutions pass and hides real model weaknesses. This paper asks whether reported pass rates actually reflect functional correctness and proposes a larger, automated test suite plus reduction techniques to answer that.
Main Contribution
Diagnosed test insufficiency in popular code benchmarks and measured its practical impact on model rankings.
Designed EvalPlus: a test-generation pipeline that uses LLM (ChatGPT) seed inputs, type-aware mutation, and differential testing against ground-truth code.
Built HUMANEVAL+ (avg 764.1 tests/task, ~80× larger) and HUMANEVAL+-MINI (avg 16.1 tests/task, ~47× smaller) and fixed buggy ground-truths.
Evaluated 26 LLMs across sampling and greedy modes; showed pass@k drops of 19.3–28.9% on average and model re-ranking effects.
Released tools, tests and generated code to the public repository for reproducible evaluation.
Key Findings
Automated augmentation increases tests per task from single-digit to hundreds.
Measured pass@k for many models falls substantially under stronger testing.
Stronger tests change relative model rankings.
Reference implementations in HumanEval contain real bugs.
A small, optimized subset of tests preserves most detection power.
Results
average tests per task
pass@1 drop (avg observed max)
pass@10 drop (max observed)
ground-truth defects found
reduced test-suite size
Who Should Care
What To Try In 7 Days
Run EvalPlus (or similar) on your internal code-gen evaluation to expand tests automatically.
Use an LLM to create structured seed inputs, then apply type-aware mutation to scale tests.
Cross-check model outputs against a validated ground-truth implementation before deployment decisions.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Seed generation uses ChatGPT which incurs cost and may reflect its biases.
- EvalPlus relies on a correct ground-truth implementation for differential testing.
- Current evaluation focuses on Python HumanEval; adapting to other languages requires work.
- Generated tests may still miss semantic constraints not expressible in contracts or seeds.
- One-hour mutation budget per task limits exhaustive exploration of very large input spaces.
When Not To Use
- You lack a trusted ground-truth oracle for differential testing.
- You require formal proofs of correctness rather than empirical testing.
- Your code tasks are in languages or domains where type-aware mutation and seeds are hard to define.
Failure Modes
- False positives from invalid inputs that slip past contract checks.
- Bias in ChatGPT seeds could steer mutations away from rare but important cases.
- Test-suite reduction tuned on known models may miss failures of a new, very different model.
- Timing and resource limits can hide time-related correctness or performance bugs.
Core Entities
Models
- GPT-4
- ChatGPT
- WizardCoder-CodeLlama
- Phind-CodeLlama
- CODELLAMA
- CodeGen
- CodeGen2
- StarCoder
- INCODER
- SantaCoder
- VICUNA
- MISTRAL
- PolyCoder
- GPT-J
- GPT-NEO
Metrics
- pass@1
- pass@10
- pass@100
- pass@1⋆ (greedy)
- test count
- mutation kills
- branch coverage
Datasets
- HUMANEVAL
- HUMANEVAL+
- HUMANEVAL+-MINI
Benchmarks
- HUMANEVAL
- HUMANEVAL+
Context Entities
Models
- CodeGen
- CodeGen2
- StarCoder
- CODET5+
- StableLM
Datasets
- MBPP
- HUMANEVAL-X
- MultiPL-E

