RepoBench: an evaluation suite for retrieval, next-line completion, and full pipelines on multi-file code

June 5, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark and baselines are well scoped and supported by many ablations; results rely on public datasets and large closed models, and some runs used quantized inference and API-limited queries which slightly reduce reproducibility confidence.

Citations11

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Tianyang Liu, Canwen Xu, Julian McAuley

Links

Abstract / PDF / Data

Why It Matters For Business

RepoBench measures retrieval plus completion across multiple files, which reflects real engineering workflows and helps teams pick retrievers, prompt formats, and models that actually improve developer productivity.

Who Should Care

Summary TLDR

RepoBench is a new benchmark for repository-level code auto-completion. It measures three linked capabilities: retrieving cross-file code snippets (RepoBench-R), predicting the next line with both in-file and cross-file context (RepoBench-C), and an end-to-end pipeline that retrieves then completes (RepoBench-P). The dataset covers Python and Java, includes long-context splits (2k and 8k tokens), and provides baselines showing that targeted retrievers and prompt designs materially improve performance but models still struggle with very long, multi-file contexts.

Problem Statement

Existing code completion benchmarks use single-file contexts and miss multi-file, repo-level patterns developers use daily. That leaves open questions about retrieval quality, long-context handling, and end-to-end pipeline behavior in realistic projects. RepoBench fills that gap by providing retrieval, completion, and pipeline tasks with realistic candidate pools and long prompts.

Main Contribution

RepoBench dataset and public task split for repository-level code auto-completion. It covers Python and Java and provides retrieval, completion, and pipeline tasks.

Task design that isolates retrieval (RepoBench-R), next-line completion with cross-file context (RepoBench-C) including 2k and 8k token variants, and an end-to-end pipeline that chains retrieval and completion (RepoBench-P).

Key Findings

Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.

NumbersUniXcoder acc@1 27.02 vs random 15.72 (Easy XF-F, Python)

Practical UseUse a semantic retriever like UniXcoder for cross-file lookup; expect ~10–12 percentage-point gains over naive methods on easy retrieval samples.

Evidence RefTable 9; RepoBench-R results

Including cross-file context improves next-line completion even if retrieved snippets are imperfect.

NumbersPipeline All EM: UniXcoder-L2H 37.11 vs Baseline 33.15 (Codex, Python)

Practical UseAdd cross-file snippets into the prompt for completion systems to get multi-point gains in Exact Match; retrieval quality still matters.

Evidence RefTable 4; RepoBench-P results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
RepoBench-R retrieval acc@1 (Easy XF-F, Python)UniXcoder 27.02 vs random 15.72random 15.72+11.30RepoBench-R Easy XF-F (Python)Table 9 UniXcoder vs RandomTable 9
RepoBench-C All EM (Java)Codex (175B) All EM 43.14CodeGen variants lower (see Table 3)Codex leads by ~10+ EM vs some open models on Java long promptsRepoBench-C (Java, mixture weighted)Table 3 Codex All EM 43.14Table 3

What To Try In 7 Days

Run UniXcoder (semantic retriever) on your repo retrieval task and compare acc@1 vs lexical baselines.

Adopt prompt layout: cross-file snippets + import statements + ~30 preceding lines and measure EM/ES on your test cases.

Prototype a pipeline that retrieves top-k snippets then completes with a strong code LLM and compare against in-file-only completion.

Optimization Features

Token Efficiency
prompt cropping strategy: reserve tokens for in-file context then fill with cross-file snippets
Infra Optimization
used Deepspeed for fine-tuning large CodeGen models
Inference Optimization
Accuracy

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Training-data overlap risk: github-code is widely used in model pretraining and may leak into model knowledge.

Some experiments used quantized models or CTranslate2 which can change model behavior versus original weights.

When Not To Use

If your codebase language is not Python or Java.

If you need evaluation of multi-line or function-level synthesis rather than single next-line prediction.

Failure Modes

Retriever returns irrelevant snippets and pollutes prompt, leading to worse completions.

Very long prompts cause models to ignore early context or degrade unpredictably.

Core Entities

Models

Codex (code-davinci-002)StarCoderCodeGen (350M, 2.7B, 6.1B, 16.1B)UniXcoderCodeBERT

Metrics

Exact Match (EM)Edit Similarity (ES)Accuracy

Datasets

github-code (codeparrot dataset)newly crawled GitHub test set (Python/Java post-Feb 2023)

Benchmarks

RepoBench-RRepoBench-C (2k/8k)RepoBench-P