EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval
Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.
Key finding
Automated augmentation increases tests per task from single-digit to hundreds.
Numbers: HumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

