Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
10
Why It Matters For Business
Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.
Summary TLDR
CrossCodeEval is a new benchmark of ~10k code-completion examples from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#. Each example is chosen so the correct completion cannot be inferred from the current file alone. The authors use a static-analysis trick (replace imports with empty classes, detect resulting undefined-name errors) to find cross-file-required completions. State-of-the-art models (CodeGen, StarCoder, GPT-3.5) perform poorly with only in-file context (single-digit exact-match rates) and roughly double performance when helpful cross-file snippets are retrieved and prepended, but peak exact-match remains low, showing room for better retrieval and model
Problem Statement
Most code-completion benchmarks give only a single-file context. Real projects span many files and require cross-file knowledge (APIs, classes, helpers). Current benchmarks understate models' need to fetch and use other files, so we need a benchmark that forces cross-file reasoning and measures retrieval+generation performance.
Main Contribution
CrossCodeEval: a multilingual benchmark with ~10k cross-file completion examples drawn from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#.
A simple, automated static-analysis method to find completions that strictly require cross-file context: replace intra-project imports with empty classes and detect undefined-name errors.
A careful quality pipeline: rule-based filters, model-based filtering (StarCoderBase-1B), and human annotation showing nearly all samples require cross-file lookup.
Comprehensive evaluation: zero-shot results for multiple code LMs and a comparison of sparse and neural retrieval methods (BM25, UniXCoder, OpenAI ada embeddings).
Key Findings
Off-the-shelf models fail on cross-file examples when only given the current file.
Prepending retrieved cross-file context roughly doubles exact-match rates for many models.
Even with retrieval and an oracle-style retrieval (using the reference), peak exact-match remains far from perfect.
Dataset quality checks show examples truly need cross-file info.
Results
Code Match EM (Python, StarCoder-15.5B)
Code Match EM (Python, CodeGen25-7B)
Retrieval comparison (CodeGen25-7B, Python EM)
Human annotation: need for cross-file
Who Should Care
What To Try In 7 Days
Add a repository retrieval step (BM25 or embeddings) that returns top-5 10-line chunks and prepend them to the model prompt.
Measure both exact-match and identifier-match for your codebase; identifier overlap strongly correlates with success.
Run a small human check: replace an import with a dummy class to find cross-file-required completions in your repo.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluation is zero-shot; few-shot or finetuning effects are not tested and may change results.
- Retrieval quality is a bottleneck: fixed 10-line chunks and token-based similarity can return unhelpful snippets.
- Potential memorization: authors filtered many repositories but cannot fully guarantee models never saw similar content during pretraining.
When Not To Use
- If you only need single-file completion benchmarks, a simpler dataset suffices.
- When evaluating models fine-tuned with repository-level context (few-shot/fine-tuned) without adjusting prompt length limits.
- For closed-source private repos where licensing or privacy prevents sharing retrieval chunks.
Failure Modes
- Retrieval returns irrelevant code and degrades completion (model Hallucinates new APIs).
- Truncation of retrieved context can cut out the crucial lines and mislead the model.
- Dataset still contains some long strings or hard-to-predict literals that cause annotator disagreement.
Core Entities
Models
- CodeGen
- CodeGen-350M
- CodeGen-2.7B
- CodeGen-6.1B
- CodeGen-16.1B
- CodeGen25-7B
- StarCoder
- StarCoder-15.5B
- StarCoderBase-1B
- StarCoderBase-3B
- StarCoderBase-7B
- GPT-3.5-turbo
- UniXCoder
Metrics
- Exact Match (EM)
- Edit Similarity (ES)
- Identifier EM
- Identifier F1
Datasets
- CrossCodeEval (this paper)
- The Stack (excluded during collection)
Benchmarks
- Code Match EM/ES
- Identifier Match EM/F1

