RepoBench: an evaluation suite for retrieval, next-line completion, and full pipelines on multi-file code

Overview

Decision SnapshotReady For Pilot

The benchmark and baselines are well scoped and supported by many ablations; results rely on public datasets and large closed models, and some runs used quantized inference and API-limited queries which slightly reduce reproducibility confidence.

Citations11

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Tianyang Liu, Canwen Xu, Julian McAuley

Links

Abstract / PDF / Data

Why It Matters For Business

RepoBench measures retrieval plus completion across multiple files, which reflects real engineering workflows and helps teams pick retrievers, prompt formats, and models that actually improve developer productivity.

Who Should Care

ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

RepoBench is a new benchmark for repository-level code auto-completion. It measures three linked capabilities: retrieving cross-file code snippets (RepoBench-R), predicting the next line with both in-file and cross-file context (RepoBench-C), and an end-to-end pipeline that retrieves then completes (RepoBench-P). The dataset covers Python and Java, includes long-context splits (2k and 8k tokens), and provides baselines showing that targeted retrievers and prompt designs materially improve performance but models still struggle with very long, multi-file contexts.

Problem Statement

Existing code completion benchmarks use single-file contexts and miss multi-file, repo-level patterns developers use daily. That leaves open questions about retrieval quality, long-context handling, and end-to-end pipeline behavior in realistic projects. RepoBench fills that gap by providing retrieval, completion, and pipeline tasks with realistic candidate pools and long prompts.

Main Contribution

RepoBench dataset and public task split for repository-level code auto-completion. It covers Python and Java and provides retrieval, completion, and pipeline tasks.

Task design that isolates retrieval (RepoBench-R), next-line completion with cross-file context (RepoBench-C) including 2k and 8k token variants, and an end-to-end pipeline that chains retrieval and completion (RepoBench-P).

Key Findings

Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.

NumbersUniXcoder acc@1 27.02 vs random 15.72 (Easy XF-F, Python)

Practical UseUse a semantic retriever like UniXcoder for cross-file lookup; expect ~10–12 percentage-point gains over naive methods on easy retrieval samples.

Evidence RefTable 9; RepoBench-R results

Including cross-file context improves next-line completion even if retrieved snippets are imperfect.

NumbersPipeline All EM: UniXcoder-L2H 37.11 vs Baseline 33.15 (Codex, Python)

Practical UseAdd cross-file snippets into the prompt for completion systems to get multi-point gains in Exact Match; retrieval quality still matters.

Evidence RefTable 4; RepoBench-P results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
RepoBench-R retrieval acc@1 (Easy XF-F, Python)	UniXcoder 27.02 vs random 15.72	random 15.72	+11.30	RepoBench-R Easy XF-F (Python)	Table 9 UniXcoder vs Random	Table 9
RepoBench-C All EM (Java)	Codex (175B) All EM 43.14	CodeGen variants lower (see Table 3)	Codex leads by ~10+ EM vs some open models on Java long prompts	RepoBench-C (Java, mixture weighted)	Table 3 Codex All EM 43.14	Table 3

What To Try In 7 Days

Run UniXcoder (semantic retriever) on your repo retrieval task and compare acc@1 vs lexical baselines.

Adopt prompt layout: cross-file snippets + import statements + ~30 preceding lines and measure EM/ES on your test cases.

Prototype a pipeline that retrieves top-k snippets then completes with a strong code LLM and compare against in-file-only completion.

Optimization Features

Token Efficiency

prompt cropping strategy: reserve tokens for in-file context then fill with cross-file snippets

Infra Optimization

used Deepspeed for fine-tuning large CodeGen models

Inference Optimization

Accuracy

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/codeparrot/github-code

Risks & Boundaries

Limitations

Training-data overlap risk: github-code is widely used in model pretraining and may leak into model knowledge.

Some experiments used quantized models or CTranslate2 which can change model behavior versus original weights.

When Not To Use

If your codebase language is not Python or Java.

If you need evaluation of multi-line or function-level synthesis rather than single next-line prediction.

Failure Modes

Retriever returns irrelevant snippets and pollutes prompt, leading to worse completions.

Very long prompts cause models to ignore early context or degrade unpredictably.

Core Entities

Models

Codex (code-davinci-002)StarCoderCodeGen (350M, 2.7B, 6.1B, 16.1B)UniXcoderCodeBERT

Metrics

Exact Match (EM)Edit Similarity (ES)Accuracy

Datasets

github-code (codeparrot dataset)newly crawled GitHub test set (Python/Java post-Feb 2023)

Benchmarks

RepoBench-RRepoBench-C (2k/8k)RepoBench-P

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.

Including cross-file context improves next-line completion even if retrieved snippets are imperfect.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding