Overview
The benchmark is a practical, reproducible dataset that reveals real-repo gaps in LLMs; evidence comes from experiments on 10 models and manual error analysis.
Citations5
Evidence Strength0.70
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.
Who Should Care
Summary TLDR
EvoCodeBench is a new, evolving Python code-generation benchmark built from recent open-source repositories. The first release (EvoCodeBench-2403) has 275 function-level samples from 25 repos, annotated with natural-language requirements, full repositories, ground-truth code, dependencies (with file paths), and executable tests. It measures functional correctness (Pass@k) and dependency recall (Recall@k). Evaluations of 10 popular LLMs show much lower real-repo performance than on classic benchmarks (e.g., gpt-4 Pass@1 20.73% on EvoCodeBench vs ~80% on HumanEval), and contexts or simple retrieval of similar functions substantially improve results. The benchmark is periodically updated to cut
Problem Statement
Existing code-generation benchmarks are not aligned with real-world repositories: they over-represent standalone functions, lack dependency and repository context, and are vulnerable to data leakage. This makes it hard to measure how LLMs perform in real development workflows.
Main Contribution
EvoCodeBench: an evolving, repository-aligned Python benchmark with requirements, full repo, reference code, dependency paths, and tests.
Repository-level code generation task: ask models to implement functions given a full repository and a natural-language requirement.
Key Findings
EvoCodeBench-2403 size and distribution match recent repositories.
LLMs perform much worse on repository tasks than on classic function benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (gpt-4, Local File Infilling) | 20.73% | HumanEval gpt-4 ~80% (cited) | -59.27 pp | EvoCodeBench-2403 | Table 4 (Local File Infilling row for gpt-4) | Table 4 |
| Pass@1 (gpt-4, Without Context) | 7.27% | — | — | EvoCodeBench-2403 | Table 4 (Without Context row for gpt-4) | Table 4 |
What To Try In 7 Days
Run your model on EvoCodeBench-2403 to gauge real-repo performance quickly.
Add nearby local-file context to prompts and compare Pass@1 uplift.
Implement simple retrieval of name-similar functions from the repo as extra context (RAG).
Reproducibility
Risks & Boundaries
Limitations
Monolingual: English requirements and Python code only.
Auto-generated requirements sometimes miss details like hyperparameters.
When Not To Use
If you need multilingual requirements or non-Python languages today.
If your use case requires full-repo cross-file reasoning beyond local-file contexts without a retrieval strategy.
Failure Modes
Logic/implementation errors in generated code.
Incomplete or missing cross-file contexts leading to wrong dependencies.

