Overview
The dataset is ready for research and prototype evaluation; experiments use standard models and metrics, human checks confirm quality, but retrieval and zero-shot limits mean production systems need further engineering.
Citations10
Evidence Strength0.80
Confidence0.84
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.
Who Should Care
Summary TLDR
CrossCodeEval is a new benchmark of ~10k code-completion examples from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#. Each example is chosen so the correct completion cannot be inferred from the current file alone. The authors use a static-analysis trick (replace imports with empty classes, detect resulting undefined-name errors) to find cross-file-required completions. State-of-the-art models (CodeGen, StarCoder, GPT-3.5) perform poorly with only in-file context (single-digit exact-match rates) and roughly double performance when helpful cross-file snippets are retrieved and prepended, but peak exact-match remains low, showing room for better retrieval and model
Problem Statement
Most code-completion benchmarks give only a single-file context. Real projects span many files and require cross-file knowledge (APIs, classes, helpers). Current benchmarks understate models' need to fetch and use other files, so we need a benchmark that forces cross-file reasoning and measures retrieval+generation performance.
Main Contribution
CrossCodeEval: a multilingual benchmark with ~10k cross-file completion examples drawn from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#.
A simple, automated static-analysis method to find completions that strictly require cross-file context: replace intra-project imports with empty classes and detect undefined-name errors.
Key Findings
Off-the-shelf models fail on cross-file examples when only given the current file.
Prepending retrieved cross-file context roughly doubles exact-match rates for many models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Code Match EM (Python, StarCoder-15.5B) | 8.82% (in-file) → 15.72% (retrieval) → 21.01% (retrieval w/ ref) | 8.82% (in-file only) | +12.19 pp (w/ retrieval w/ ref) | CROSSCODEEVAL (Python) | Table 2 (Code Match EM, StarCoder-15.5B, Python) | Table 2 |
| Code Match EM (Python, CodeGen25-7B) | 7.73% (in-file) → 14.52% (BM25 retrieval) → 19.17% (retrieval w/ ref) | 7.73% (in-file only) | +11.44 pp (w/ retrieval w/ ref) | CROSSCODEEVAL (Python) | Table 2 (CodeGen25-7B, Python EM) | Table 2 |
What To Try In 7 Days
Add a repository retrieval step (BM25 or embeddings) that returns top-5 10-line chunks and prepend them to the model prompt.
Measure both exact-match and identifier-match for your codebase; identifier overlap strongly correlates with success.
Run a small human check: replace an import with a dummy class to find cross-file-required completions in your repo.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Evaluation is zero-shot; few-shot or finetuning effects are not tested and may change results.
Retrieval quality is a bottleneck: fixed 10-line chunks and token-based similarity can return unhelpful snippets.
When Not To Use
If you only need single-file completion benchmarks, a simpler dataset suffices.
When evaluating models fine-tuned with repository-level context (few-shot/fine-tuned) without adjusting prompt length limits.
Failure Modes
Retrieval returns irrelevant code and degrades completion (model Hallucinates new APIs).
Truncation of retrieved context can cut out the crucial lines and mislead the model.

