Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
5
Why It Matters For Business
EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.
Summary TLDR
EvoCodeBench is a new, evolving Python code-generation benchmark built from recent open-source repositories. The first release (EvoCodeBench-2403) has 275 function-level samples from 25 repos, annotated with natural-language requirements, full repositories, ground-truth code, dependencies (with file paths), and executable tests. It measures functional correctness (Pass@k) and dependency recall (Recall@k). Evaluations of 10 popular LLMs show much lower real-repo performance than on classic benchmarks (e.g., gpt-4 Pass@1 20.73% on EvoCodeBench vs ~80% on HumanEval), and contexts or simple retrieval of similar functions substantially improve results. The benchmark is periodically updated to cut
Problem Statement
Existing code-generation benchmarks are not aligned with real-world repositories: they over-represent standalone functions, lack dependency and repository context, and are vulnerable to data leakage. This makes it hard to measure how LLMs perform in real development workflows.
Main Contribution
EvoCodeBench: an evolving, repository-aligned Python benchmark with requirements, full repo, reference code, dependency paths, and tests.
Repository-level code generation task: ask models to implement functions given a full repository and a natural-language requirement.
Metrics: functional Pass@k plus Recall@k to measure whether generated code invokes the same dependencies as reference code.
Public release (EvoCodeBench-2403): 275 samples from 25 repositories created 2023-10 to 2024-02 and an automatic pipeline to update the benchmark.
Key Findings
EvoCodeBench-2403 size and distribution match recent repositories.
LLMs perform much worse on repository tasks than on classic function benchmarks.
Including local-file context raises correctness and dependency recall substantially.
Simple retrieval of similar functions helps when full repo context is unavailable.
Main failure modes are implementation logic errors and missing contexts.
Automatic dependency extraction has small bias versus humans.
Results
Pass@1 (gpt-4, Local File Infilling)
Pass@1 (gpt-4, Without Context)
Pass@1 (gpt-4, Local File Completion)
RAG improvement (gpt-4 Pass@1)
Dataset composition
Who Should Care
What To Try In 7 Days
Run your model on EvoCodeBench-2403 to gauge real-repo performance quickly.
Add nearby local-file context to prompts and compare Pass@1 uplift.
Implement simple retrieval of name-similar functions from the repo as extra context (RAG).
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Monolingual: English requirements and Python code only.
- Auto-generated requirements sometimes miss details like hyperparameters.
- Dependency Recall (Recall@k) can slightly undercount runtime-determined dependencies due to static parsing (observed bias 0.16).
- Current context extraction is local-file based; cross-file context and broader retrieval remain unexplored.
When Not To Use
- If you need multilingual requirements or non-Python languages today.
- If your use case requires full-repo cross-file reasoning beyond local-file contexts without a retrieval strategy.
- When you require perfect dependency recall for dynamic runtime behaviors.
Failure Modes
- Logic/implementation errors in generated code.
- Incomplete or missing cross-file contexts leading to wrong dependencies.
- Parser undercounting of runtime dependencies reduces measured Recall@k.
- Model sensitivity to prompt templates and sampling hyper-parameters.
Core Entities
Models
- gpt-4
- gpt-3.5
- DeepSeek Coder
- StarCoder 2
- CodeLLaMa
- Gemma
- Qwen 1.5
Metrics
- Pass@k
- Recall@k
Datasets
- EvoCodeBench-2403
Benchmarks
- HumanEval
- MBPP
- APPS
- CoderEval
- ClassEval
Context Entities
Datasets
- 500 real repositories (analysis set)

