Overview
The benchmark and baselines are well scoped and supported by many ablations; results rely on public datasets and large closed models, and some runs used quantized inference and API-limited queries which slightly reduce reproducibility confidence.
Citations11
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
RepoBench measures retrieval plus completion across multiple files, which reflects real engineering workflows and helps teams pick retrievers, prompt formats, and models that actually improve developer productivity.
Who Should Care
Summary TLDR
RepoBench is a new benchmark for repository-level code auto-completion. It measures three linked capabilities: retrieving cross-file code snippets (RepoBench-R), predicting the next line with both in-file and cross-file context (RepoBench-C), and an end-to-end pipeline that retrieves then completes (RepoBench-P). The dataset covers Python and Java, includes long-context splits (2k and 8k tokens), and provides baselines showing that targeted retrievers and prompt designs materially improve performance but models still struggle with very long, multi-file contexts.
Problem Statement
Existing code completion benchmarks use single-file contexts and miss multi-file, repo-level patterns developers use daily. That leaves open questions about retrieval quality, long-context handling, and end-to-end pipeline behavior in realistic projects. RepoBench fills that gap by providing retrieval, completion, and pipeline tasks with realistic candidate pools and long prompts.
Main Contribution
RepoBench dataset and public task split for repository-level code auto-completion. It covers Python and Java and provides retrieval, completion, and pipeline tasks.
Task design that isolates retrieval (RepoBench-R), next-line completion with cross-file context (RepoBench-C) including 2k and 8k token variants, and an end-to-end pipeline that chains retrieval and completion (RepoBench-P).
Key Findings
Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.
Including cross-file context improves next-line completion even if retrieved snippets are imperfect.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| RepoBench-R retrieval acc@1 (Easy XF-F, Python) | UniXcoder 27.02 vs random 15.72 | random 15.72 | +11.30 | RepoBench-R Easy XF-F (Python) | Table 9 UniXcoder vs Random | Table 9 |
| RepoBench-C All EM (Java) | Codex (175B) All EM 43.14 | CodeGen variants lower (see Table 3) | Codex leads by ~10+ EM vs some open models on Java long prompts | RepoBench-C (Java, mixture weighted) | Table 3 Codex All EM 43.14 | Table 3 |
What To Try In 7 Days
Run UniXcoder (semantic retriever) on your repo retrieval task and compare acc@1 vs lexical baselines.
Adopt prompt layout: cross-file snippets + import statements + ~30 preceding lines and measure EM/ES on your test cases.
Prototype a pipeline that retrieves top-k snippets then completes with a strong code LLM and compare against in-file-only completion.
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Training-data overlap risk: github-code is widely used in model pretraining and may leak into model knowledge.
Some experiments used quantized models or CTranslate2 which can change model behavior versus original weights.
When Not To Use
If your codebase language is not Python or Java.
If you need evaluation of multi-line or function-level synthesis rather than single next-line prediction.
Failure Modes
Retriever returns irrelevant snippets and pollutes prompt, leading to worse completions.
Very long prompts cause models to ignore early context or degrade unpredictably.

