CrossCodeEval: 10k multilingual examples that force models to read other files to complete code

Overview

Decision SnapshotReady For Pilot

The dataset is ready for research and prototype evaluation; experiments use standard models and metrics, human checks confirm quality, but retrieval and zero-shot limits mean production systems need further engineering.

Citations10

Evidence Strength0.80

Confidence0.84

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Data Scientist

Summary TLDR

CrossCodeEval is a new benchmark of ~10k code-completion examples from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#. Each example is chosen so the correct completion cannot be inferred from the current file alone. The authors use a static-analysis trick (replace imports with empty classes, detect resulting undefined-name errors) to find cross-file-required completions. State-of-the-art models (CodeGen, StarCoder, GPT-3.5) perform poorly with only in-file context (single-digit exact-match rates) and roughly double performance when helpful cross-file snippets are retrieved and prepended, but peak exact-match remains low, showing room for better retrieval and model

Problem Statement

Most code-completion benchmarks give only a single-file context. Real projects span many files and require cross-file knowledge (APIs, classes, helpers). Current benchmarks understate models' need to fetch and use other files, so we need a benchmark that forces cross-file reasoning and measures retrieval+generation performance.

Main Contribution

CrossCodeEval: a multilingual benchmark with ~10k cross-file completion examples drawn from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#.

A simple, automated static-analysis method to find completions that strictly require cross-file context: replace intra-project imports with empty classes and detect undefined-name errors.

Key Findings

Off-the-shelf models fail on cross-file examples when only given the current file.

NumbersStarCoder-15.5B Python EM 8.82% (in-file only)

Practical UseDo not trust single-file completion scores; add repository-level context or retrieval to evaluate real-world code completion.

Evidence RefTable 2 (Code Match EM, StarCoder-15.5B, Python)

Prepending retrieved cross-file context roughly doubles exact-match rates for many models.

NumbersCodeGen25-7B Python EM: 7.73% → 14.52% (+~1.9×) with BM25 retrieval

Practical UseAdd a retrieval step (e.g., BM25/embeddings) before generation to get substantial gains in completion accuracy.

Evidence RefTable 2 (CodeGen25-7B, Python EM)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Code Match EM (Python, StarCoder-15.5B)	8.82% (in-file) → 15.72% (retrieval) → 21.01% (retrieval w/ ref)	8.82% (in-file only)	+12.19 pp (w/ retrieval w/ ref)	CROSSCODEEVAL (Python)	Table 2 (Code Match EM, StarCoder-15.5B, Python)	Table 2
Code Match EM (Python, CodeGen25-7B)	7.73% (in-file) → 14.52% (BM25 retrieval) → 19.17% (retrieval w/ ref)	7.73% (in-file only)	+11.44 pp (w/ retrieval w/ ref)	CROSSCODEEVAL (Python)	Table 2 (CodeGen25-7B, Python EM)	Table 2

What To Try In 7 Days

Add a repository retrieval step (BM25 or embeddings) that returns top-5 10-line chunks and prepend them to the model prompt.

Measure both exact-match and identifier-match for your codebase; identifier overlap strongly correlates with success.

Run a small human check: replace an import with a dummy class to find cross-file-required completions in your repo.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://crosscodeeval.github.io

Data URLs

https://crosscodeeval.github.io

Risks & Boundaries

Limitations

Evaluation is zero-shot; few-shot or finetuning effects are not tested and may change results.

Retrieval quality is a bottleneck: fixed 10-line chunks and token-based similarity can return unhelpful snippets.

When Not To Use

If you only need single-file completion benchmarks, a simpler dataset suffices.

When evaluating models fine-tuned with repository-level context (few-shot/fine-tuned) without adjusting prompt length limits.

Failure Modes

Retrieval returns irrelevant code and degrades completion (model Hallucinates new APIs).

Truncation of retrieved context can cut out the crucial lines and mislead the model.

Core Entities

Models

CodeGenCodeGen-350MCodeGen-2.7BCodeGen-6.1BCodeGen-16.1BCodeGen25-7BStarCoderStarCoder-15.5BStarCoderBase-1BStarCoderBase-3BStarCoderBase-7BGPT-3.5-turboUniXCoder

Metrics

Exact Match (EM)Edit Similarity (ES)Identifier EMIdentifier F1

Datasets

CrossCodeEval (this paper)The Stack (excluded during collection)

Benchmarks

Code Match EM/ESIdentifier Match EM/F1

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Off-the-shelf models fail on cross-file examples when only given the current file.

Prepending retrieved cross-file context roughly doubles exact-match rates for many models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding