Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
11
Why It Matters For Business
RepoBench measures retrieval plus completion across multiple files, which reflects real engineering workflows and helps teams pick retrievers, prompt formats, and models that actually improve developer productivity.
Summary TLDR
RepoBench is a new benchmark for repository-level code auto-completion. It measures three linked capabilities: retrieving cross-file code snippets (RepoBench-R), predicting the next line with both in-file and cross-file context (RepoBench-C), and an end-to-end pipeline that retrieves then completes (RepoBench-P). The dataset covers Python and Java, includes long-context splits (2k and 8k tokens), and provides baselines showing that targeted retrievers and prompt designs materially improve performance but models still struggle with very long, multi-file contexts.
Problem Statement
Existing code completion benchmarks use single-file contexts and miss multi-file, repo-level patterns developers use daily. That leaves open questions about retrieval quality, long-context handling, and end-to-end pipeline behavior in realistic projects. RepoBench fills that gap by providing retrieval, completion, and pipeline tasks with realistic candidate pools and long prompts.
Main Contribution
RepoBench dataset and public task split for repository-level code auto-completion. It covers Python and Java and provides retrieval, completion, and pipeline tasks.
Task design that isolates retrieval (RepoBench-R), next-line completion with cross-file context (RepoBench-C) including 2k and 8k token variants, and an end-to-end pipeline that chains retrieval and completion (RepoBench-P).
A large baseline suite and ablations showing: (1) semantic retrievers (UniXcoder) beat lexical and random methods for retrieval, (2) prompt construction matters (include import statements + cross-file snippets), and (3) retrieval improves pipeline completion versus no cross-file context.
Key Findings
Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.
Including cross-file context improves next-line completion even if retrieved snippets are imperfect.
Prompt construction that combines cross-file snippets, import statements, and a short in-file window gives the best completion scores.
Models show clear degradation when prompts include too much unrelated context; small retrieval windows work best.
Model rankings vary by language and context length; Codex tends to lead on long-context Java tasks.
Results
RepoBench-R retrieval acc@1 (Easy XF-F, Python)
RepoBench-C All EM (Java)
RepoBench-P pipeline All EM (Python)
Prompt ablation All EM (Python)
Who Should Care
What To Try In 7 Days
Run UniXcoder (semantic retriever) on your repo retrieval task and compare acc@1 vs lexical baselines.
Adopt prompt layout: cross-file snippets + import statements + ~30 preceding lines and measure EM/ES on your test cases.
Prototype a pipeline that retrieves top-k snippets then completes with a strong code LLM and compare against in-file-only completion.
Optimization Features
Token Efficiency
- prompt cropping strategy: reserve tokens for in-file context then fill with cross-file snippets
Infra Optimization
- used Deepspeed for fine-tuning large CodeGen models
Inference Optimization
- Accuracy
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training-data overlap risk: github-code is widely used in model pretraining and may leak into model knowledge.
- Some experiments used quantized models or CTranslate2 which can change model behavior versus original weights.
- Codex API rate limits prevented repeated randomness checks for some retrieval strategies, so random-retrieval results are less stable.
When Not To Use
- If your codebase language is not Python or Java.
- If you need evaluation of multi-line or function-level synthesis rather than single next-line prediction.
- If you cannot tolerate potential training-data leakage from common public corpora.
Failure Modes
- Retriever returns irrelevant snippets and pollutes prompt, leading to worse completions.
- Very long prompts cause models to ignore early context or degrade unpredictably.
- Model rankings change by language and prompt length, so a chosen model may underperform outside tested settings.
Core Entities
Models
- Codex (code-davinci-002)
- StarCoder
- CodeGen (350M, 2.7B, 6.1B, 16.1B)
- UniXcoder
- CodeBERT
Metrics
- Exact Match (EM)
- Edit Similarity (ES)
- Accuracy
Datasets
- github-code (codeparrot dataset)
- newly crawled GitHub test set (Python/Java post-Feb 2023)
Benchmarks
- RepoBench-R
- RepoBench-C (2k/8k)
- RepoBench-P

