RepoBench: an evaluation suite for retrieval, next-line completion, and full pipelines on multi-file code

June 5, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

11

Authors

Tianyang Liu, Canwen Xu, Julian McAuley

Links

Abstract / PDF

Why It Matters For Business

RepoBench measures retrieval plus completion across multiple files, which reflects real engineering workflows and helps teams pick retrievers, prompt formats, and models that actually improve developer productivity.

Summary TLDR

RepoBench is a new benchmark for repository-level code auto-completion. It measures three linked capabilities: retrieving cross-file code snippets (RepoBench-R), predicting the next line with both in-file and cross-file context (RepoBench-C), and an end-to-end pipeline that retrieves then completes (RepoBench-P). The dataset covers Python and Java, includes long-context splits (2k and 8k tokens), and provides baselines showing that targeted retrievers and prompt designs materially improve performance but models still struggle with very long, multi-file contexts.

Problem Statement

Existing code completion benchmarks use single-file contexts and miss multi-file, repo-level patterns developers use daily. That leaves open questions about retrieval quality, long-context handling, and end-to-end pipeline behavior in realistic projects. RepoBench fills that gap by providing retrieval, completion, and pipeline tasks with realistic candidate pools and long prompts.

Main Contribution

RepoBench dataset and public task split for repository-level code auto-completion. It covers Python and Java and provides retrieval, completion, and pipeline tasks.

Task design that isolates retrieval (RepoBench-R), next-line completion with cross-file context (RepoBench-C) including 2k and 8k token variants, and an end-to-end pipeline that chains retrieval and completion (RepoBench-P).

A large baseline suite and ablations showing: (1) semantic retrievers (UniXcoder) beat lexical and random methods for retrieval, (2) prompt construction matters (include import statements + cross-file snippets), and (3) retrieval improves pipeline completion versus no cross-file context.

Key Findings

Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.

NumbersUniXcoder acc@1 27.02 vs random 15.72 (Easy XF-F, Python)

Including cross-file context improves next-line completion even if retrieved snippets are imperfect.

NumbersPipeline All EM: UniXcoder-L2H 37.11 vs Baseline 33.15 (Codex, Python)

Prompt construction that combines cross-file snippets, import statements, and a short in-file window gives the best completion scores.

NumbersXFC+IS+IFC-Short All EM 37.64 vs IFC-Short 26.64 (Python ablation)

Models show clear degradation when prompts include too much unrelated context; small retrieval windows work best.

NumbersUniXcoder acc@1 drops 27.02 (3 lines) → 16.09 (120 lines) in Easy XF-F

Model rankings vary by language and context length; Codex tends to lead on long-context Java tasks.

NumbersCodex All EM (Java) 43.14 vs StarCoder All EM 31.67 in comparable rows

Results

RepoBench-R retrieval acc@1 (Easy XF-F, Python)

ValueUniXcoder 27.02 vs random 15.72

Baselinerandom 15.72

RepoBench-C All EM (Java)

ValueCodex (175B) All EM 43.14

BaselineCodeGen variants lower (see Table 3)

RepoBench-P pipeline All EM (Python)

ValueUniXcoder-L2H All EM 37.11 vs Baseline 33.15

BaselineBaseline (no cross-file context) 33.15

Prompt ablation All EM (Python)

ValueXFC+IS+IFC-Short All EM 37.64 vs IFC-Short 26.64

BaselineIFC-Short 26.64

Who Should Care

What To Try In 7 Days

Run UniXcoder (semantic retriever) on your repo retrieval task and compare acc@1 vs lexical baselines.

Adopt prompt layout: cross-file snippets + import statements + ~30 preceding lines and measure EM/ES on your test cases.

Prototype a pipeline that retrieves top-k snippets then completes with a strong code LLM and compare against in-file-only completion.

Optimization Features

Token Efficiency

  • prompt cropping strategy: reserve tokens for in-file context then fill with cross-file snippets

Infra Optimization

  • used Deepspeed for fine-tuning large CodeGen models

Inference Optimization

  • Accuracy

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training-data overlap risk: github-code is widely used in model pretraining and may leak into model knowledge.
  • Some experiments used quantized models or CTranslate2 which can change model behavior versus original weights.
  • Codex API rate limits prevented repeated randomness checks for some retrieval strategies, so random-retrieval results are less stable.

When Not To Use

  • If your codebase language is not Python or Java.
  • If you need evaluation of multi-line or function-level synthesis rather than single next-line prediction.
  • If you cannot tolerate potential training-data leakage from common public corpora.

Failure Modes

  • Retriever returns irrelevant snippets and pollutes prompt, leading to worse completions.
  • Very long prompts cause models to ignore early context or degrade unpredictably.
  • Model rankings change by language and prompt length, so a chosen model may underperform outside tested settings.

Core Entities

Models

  • Codex (code-davinci-002)
  • StarCoder
  • CodeGen (350M, 2.7B, 6.1B, 16.1B)
  • UniXcoder
  • CodeBERT

Metrics

  • Exact Match (EM)
  • Edit Similarity (ES)
  • Accuracy

Datasets

  • github-code (codeparrot dataset)
  • newly crawled GitHub test set (Python/Java post-Feb 2023)

Benchmarks

  • RepoBench-R
  • RepoBench-C (2k/8k)
  • RepoBench-P