A realistic, evolving benchmark for repository-level code generation drawn from recent GitHub projects

Overview

Decision SnapshotNeeds Validation

The benchmark is a practical, reproducible dataset that reveals real-repo gaps in LLMs; evidence comes from experiments on 10 models and manual error analysis.

Citations5

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

EvoCodeBench is a new, evolving Python code-generation benchmark built from recent open-source repositories. The first release (EvoCodeBench-2403) has 275 function-level samples from 25 repos, annotated with natural-language requirements, full repositories, ground-truth code, dependencies (with file paths), and executable tests. It measures functional correctness (Pass@k) and dependency recall (Recall@k). Evaluations of 10 popular LLMs show much lower real-repo performance than on classic benchmarks (e.g., gpt-4 Pass@1 20.73% on EvoCodeBench vs ~80% on HumanEval), and contexts or simple retrieval of similar functions substantially improve results. The benchmark is periodically updated to cut

Problem Statement

Existing code-generation benchmarks are not aligned with real-world repositories: they over-represent standalone functions, lack dependency and repository context, and are vulnerable to data leakage. This makes it hard to measure how LLMs perform in real development workflows.

Main Contribution

EvoCodeBench: an evolving, repository-aligned Python benchmark with requirements, full repo, reference code, dependency paths, and tests.

Repository-level code generation task: ask models to implement functions given a full repository and a natural-language requirement.

Key Findings

EvoCodeBench-2403 size and distribution match recent repositories.

Numbers275 samples, 25 repos; standalone 27% / non-standalone 73%; avg dependencies 3.46

Practical UseUse EvoCodeBench when you need a small, realistic sample set that reflects multi-file dependency patterns in modern Python projects.

Evidence RefTable 2; Section 2.4

LLMs perform much worse on repository tasks than on classic function benchmarks.

Numbersgpt-4 Pass@1: 20.73% (local infilling) vs ~80% on HumanEval (cited)

Practical UseDon't trust high HumanEval-like scores to predict real-repo performance; evaluate models on repo-aligned data before deployment.

Evidence RefIntro and Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (gpt-4, Local File Infilling)	20.73%	HumanEval gpt-4 ~80% (cited)	-59.27 pp	EvoCodeBench-2403	Table 4 (Local File Infilling row for gpt-4)	Table 4
Pass@1 (gpt-4, Without Context)	7.27%	—	—	EvoCodeBench-2403	Table 4 (Without Context row for gpt-4)	Table 4

What To Try In 7 Days

Run your model on EvoCodeBench-2403 to gauge real-repo performance quickly.

Add nearby local-file context to prompts and compare Pass@1 uplift.

Implement simple retrieval of name-similar functions from the repo as extra context (RAG).

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/seketeam/EvoCodeBench

Data URLs

https://github.com/seketeam/EvoCodeBench

Risks & Boundaries

Limitations

Monolingual: English requirements and Python code only.

Auto-generated requirements sometimes miss details like hyperparameters.

When Not To Use

If you need multilingual requirements or non-Python languages today.

If your use case requires full-repo cross-file reasoning beyond local-file contexts without a retrieval strategy.

Failure Modes

Logic/implementation errors in generated code.

Incomplete or missing cross-file contexts leading to wrong dependencies.

Core Entities

Models

gpt-4gpt-3.5DeepSeek CoderStarCoder 2CodeLLaMaGemmaQwen 1.5

Metrics

Pass@kRecall@k

Datasets

EvoCodeBench-2403

Benchmarks

HumanEvalMBPPAPPSCoderEvalClassEval

Context Entities

Datasets

500 real repositories (analysis set)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

EvoCodeBench-2403 size and distribution match recent repositories.

LLMs perform much worse on repository tasks than on classic function benchmarks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding