A realistic, evolving benchmark for repository-level code generation drawn from recent GitHub projects

March 31, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark is a practical, reproducible dataset that reveals real-repo gaps in LLMs; evidence comes from experiments on 10 models and manual error analysis.

Citations5

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.

Who Should Care

Summary TLDR

EvoCodeBench is a new, evolving Python code-generation benchmark built from recent open-source repositories. The first release (EvoCodeBench-2403) has 275 function-level samples from 25 repos, annotated with natural-language requirements, full repositories, ground-truth code, dependencies (with file paths), and executable tests. It measures functional correctness (Pass@k) and dependency recall (Recall@k). Evaluations of 10 popular LLMs show much lower real-repo performance than on classic benchmarks (e.g., gpt-4 Pass@1 20.73% on EvoCodeBench vs ~80% on HumanEval), and contexts or simple retrieval of similar functions substantially improve results. The benchmark is periodically updated to cut

Problem Statement

Existing code-generation benchmarks are not aligned with real-world repositories: they over-represent standalone functions, lack dependency and repository context, and are vulnerable to data leakage. This makes it hard to measure how LLMs perform in real development workflows.

Main Contribution

EvoCodeBench: an evolving, repository-aligned Python benchmark with requirements, full repo, reference code, dependency paths, and tests.

Repository-level code generation task: ask models to implement functions given a full repository and a natural-language requirement.

Key Findings

EvoCodeBench-2403 size and distribution match recent repositories.

Numbers275 samples, 25 repos; standalone 27% / non-standalone 73%; avg dependencies 3.46

Practical UseUse EvoCodeBench when you need a small, realistic sample set that reflects multi-file dependency patterns in modern Python projects.

Evidence RefTable 2; Section 2.4

LLMs perform much worse on repository tasks than on classic function benchmarks.

Numbersgpt-4 Pass@1: 20.73% (local infilling) vs ~80% on HumanEval (cited)

Practical UseDon't trust high HumanEval-like scores to predict real-repo performance; evaluate models on repo-aligned data before deployment.

Evidence RefIntro and Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (gpt-4, Local File Infilling)20.73%HumanEval gpt-4 ~80% (cited)-59.27 ppEvoCodeBench-2403Table 4 (Local File Infilling row for gpt-4)Table 4
Pass@1 (gpt-4, Without Context)7.27%EvoCodeBench-2403Table 4 (Without Context row for gpt-4)Table 4

What To Try In 7 Days

Run your model on EvoCodeBench-2403 to gauge real-repo performance quickly.

Add nearby local-file context to prompts and compare Pass@1 uplift.

Implement simple retrieval of name-similar functions from the repo as extra context (RAG).

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Monolingual: English requirements and Python code only.

Auto-generated requirements sometimes miss details like hyperparameters.

When Not To Use

If you need multilingual requirements or non-Python languages today.

If your use case requires full-repo cross-file reasoning beyond local-file contexts without a retrieval strategy.

Failure Modes

Logic/implementation errors in generated code.

Incomplete or missing cross-file contexts leading to wrong dependencies.

Core Entities

Models

gpt-4gpt-3.5DeepSeek CoderStarCoder 2CodeLLaMaGemmaQwen 1.5

Metrics

Pass@kRecall@k

Datasets

EvoCodeBench-2403

Benchmarks

HumanEvalMBPPAPPSCoderEvalClassEval

Context Entities

Datasets

500 real repositories (analysis set)