CrossCodeEval: 10k multilingual examples that force models to read other files to complete code

October 17, 20238 min

Overview

Decision SnapshotReady For Pilot

The dataset is ready for research and prototype evaluation; experiments use standard models and metrics, human checks confirm quality, but retrieval and zero-shot limits mean production systems need further engineering.

Citations10

Evidence Strength0.80

Confidence0.84

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.

Who Should Care

Summary TLDR

CrossCodeEval is a new benchmark of ~10k code-completion examples from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#. Each example is chosen so the correct completion cannot be inferred from the current file alone. The authors use a static-analysis trick (replace imports with empty classes, detect resulting undefined-name errors) to find cross-file-required completions. State-of-the-art models (CodeGen, StarCoder, GPT-3.5) perform poorly with only in-file context (single-digit exact-match rates) and roughly double performance when helpful cross-file snippets are retrieved and prepended, but peak exact-match remains low, showing room for better retrieval and model

Problem Statement

Most code-completion benchmarks give only a single-file context. Real projects span many files and require cross-file knowledge (APIs, classes, helpers). Current benchmarks understate models' need to fetch and use other files, so we need a benchmark that forces cross-file reasoning and measures retrieval+generation performance.

Main Contribution

CrossCodeEval: a multilingual benchmark with ~10k cross-file completion examples drawn from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#.

A simple, automated static-analysis method to find completions that strictly require cross-file context: replace intra-project imports with empty classes and detect undefined-name errors.

Key Findings

Off-the-shelf models fail on cross-file examples when only given the current file.

NumbersStarCoder-15.5B Python EM 8.82% (in-file only)

Practical UseDo not trust single-file completion scores; add repository-level context or retrieval to evaluate real-world code completion.

Evidence RefTable 2 (Code Match EM, StarCoder-15.5B, Python)

Prepending retrieved cross-file context roughly doubles exact-match rates for many models.

NumbersCodeGen25-7B Python EM: 7.73%14.52% (+~1.9×) with BM25 retrieval

Practical UseAdd a retrieval step (e.g., BM25/embeddings) before generation to get substantial gains in completion accuracy.

Evidence RefTable 2 (CodeGen25-7B, Python EM)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Code Match EM (Python, StarCoder-15.5B)8.82% (in-file) → 15.72% (retrieval) → 21.01% (retrieval w/ ref)8.82% (in-file only)+12.19 pp (w/ retrieval w/ ref)CROSSCODEEVAL (Python)Table 2 (Code Match EM, StarCoder-15.5B, Python)Table 2
Code Match EM (Python, CodeGen25-7B)7.73% (in-file) → 14.52% (BM25 retrieval) → 19.17% (retrieval w/ ref)7.73% (in-file only)+11.44 pp (w/ retrieval w/ ref)CROSSCODEEVAL (Python)Table 2 (CodeGen25-7B, Python EM)Table 2

What To Try In 7 Days

Add a repository retrieval step (BM25 or embeddings) that returns top-5 10-line chunks and prepend them to the model prompt.

Measure both exact-match and identifier-match for your codebase; identifier overlap strongly correlates with success.

Run a small human check: replace an import with a dummy class to find cross-file-required completions in your repo.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is zero-shot; few-shot or finetuning effects are not tested and may change results.

Retrieval quality is a bottleneck: fixed 10-line chunks and token-based similarity can return unhelpful snippets.

When Not To Use

If you only need single-file completion benchmarks, a simpler dataset suffices.

When evaluating models fine-tuned with repository-level context (few-shot/fine-tuned) without adjusting prompt length limits.

Failure Modes

Retrieval returns irrelevant code and degrades completion (model Hallucinates new APIs).

Truncation of retrieved context can cut out the crucial lines and mislead the model.

Core Entities

Models

CodeGenCodeGen-350MCodeGen-2.7BCodeGen-6.1BCodeGen-16.1BCodeGen25-7BStarCoderStarCoder-15.5BStarCoderBase-1BStarCoderBase-3BStarCoderBase-7BGPT-3.5-turboUniXCoder

Metrics

Exact Match (EM)Edit Similarity (ES)Identifier EMIdentifier F1

Datasets

CrossCodeEval (this paper)The Stack (excluded during collection)

Benchmarks

Code Match EM/ESIdentifier Match EM/F1