A realistic, evolving benchmark for repository-level code generation drawn from recent GitHub projects

March 31, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

5

Authors

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin

Links

Abstract / PDF

Why It Matters For Business

EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.

Summary TLDR

EvoCodeBench is a new, evolving Python code-generation benchmark built from recent open-source repositories. The first release (EvoCodeBench-2403) has 275 function-level samples from 25 repos, annotated with natural-language requirements, full repositories, ground-truth code, dependencies (with file paths), and executable tests. It measures functional correctness (Pass@k) and dependency recall (Recall@k). Evaluations of 10 popular LLMs show much lower real-repo performance than on classic benchmarks (e.g., gpt-4 Pass@1 20.73% on EvoCodeBench vs ~80% on HumanEval), and contexts or simple retrieval of similar functions substantially improve results. The benchmark is periodically updated to cut

Problem Statement

Existing code-generation benchmarks are not aligned with real-world repositories: they over-represent standalone functions, lack dependency and repository context, and are vulnerable to data leakage. This makes it hard to measure how LLMs perform in real development workflows.

Main Contribution

EvoCodeBench: an evolving, repository-aligned Python benchmark with requirements, full repo, reference code, dependency paths, and tests.

Repository-level code generation task: ask models to implement functions given a full repository and a natural-language requirement.

Metrics: functional Pass@k plus Recall@k to measure whether generated code invokes the same dependencies as reference code.

Public release (EvoCodeBench-2403): 275 samples from 25 repositories created 2023-10 to 2024-02 and an automatic pipeline to update the benchmark.

Key Findings

EvoCodeBench-2403 size and distribution match recent repositories.

Numbers275 samples, 25 repos; standalone 27% / non-standalone 73%; avg dependencies 3.46

LLMs perform much worse on repository tasks than on classic function benchmarks.

Numbersgpt-4 Pass@1: 20.73% (local infilling) vs ~80% on HumanEval (cited)

Including local-file context raises correctness and dependency recall substantially.

Numbersgpt-4 Pass@1: 7.27% (no context) → 17.45% (local completion) → 20.73% (local infill)

Simple retrieval of similar functions helps when full repo context is unavailable.

Numbersgpt-4 Pass@1: 8.31% (no context) → 12.29% (similar-functions retrieval)

Main failure modes are implementation logic errors and missing contexts.

NumbersManual analysis of 50 gpt-4 errors: 29 logic errors, 20 missing-context errors, 1 vague requirement

Automatic dependency extraction has small bias versus humans.

NumbersParser bias on Recall@1 is 0.16 (compared to 7.77 avg variation across LLMs)

Results

Pass@1 (gpt-4, Local File Infilling)

Value20.73%

BaselineHumanEval gpt-4 ~80% (cited)

Pass@1 (gpt-4, Without Context)

Value7.27%

Pass@1 (gpt-4, Local File Completion)

Value17.45%

RAG improvement (gpt-4 Pass@1)

Value8.31% → 12.29%

BaselineWithout context

Dataset composition

ValueStandalone 27% / Non-standalone 73%

Baseline500 real repos have same 27% / 73%

Who Should Care

What To Try In 7 Days

Run your model on EvoCodeBench-2403 to gauge real-repo performance quickly.

Add nearby local-file context to prompts and compare Pass@1 uplift.

Implement simple retrieval of name-similar functions from the repo as extra context (RAG).

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Monolingual: English requirements and Python code only.
  • Auto-generated requirements sometimes miss details like hyperparameters.
  • Dependency Recall (Recall@k) can slightly undercount runtime-determined dependencies due to static parsing (observed bias 0.16).
  • Current context extraction is local-file based; cross-file context and broader retrieval remain unexplored.

When Not To Use

  • If you need multilingual requirements or non-Python languages today.
  • If your use case requires full-repo cross-file reasoning beyond local-file contexts without a retrieval strategy.
  • When you require perfect dependency recall for dynamic runtime behaviors.

Failure Modes

  • Logic/implementation errors in generated code.
  • Incomplete or missing cross-file contexts leading to wrong dependencies.
  • Parser undercounting of runtime dependencies reduces measured Recall@k.
  • Model sensitivity to prompt templates and sampling hyper-parameters.

Core Entities

Models

  • gpt-4
  • gpt-3.5
  • DeepSeek Coder
  • StarCoder 2
  • CodeLLaMa
  • Gemma
  • Qwen 1.5

Metrics

  • Pass@k
  • Recall@k

Datasets

  • EvoCodeBench-2403

Benchmarks

  • HumanEval
  • MBPP
  • APPS
  • CoderEval
  • ClassEval

Context Entities

Datasets

  • 500 real repositories (analysis set)