CrossCodeEval: 10k multilingual examples that force models to read other files to complete code

October 17, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

10

Authors

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang

Links

Abstract / PDF

Why It Matters For Business

Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.

Summary TLDR

CrossCodeEval is a new benchmark of ~10k code-completion examples from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#. Each example is chosen so the correct completion cannot be inferred from the current file alone. The authors use a static-analysis trick (replace imports with empty classes, detect resulting undefined-name errors) to find cross-file-required completions. State-of-the-art models (CodeGen, StarCoder, GPT-3.5) perform poorly with only in-file context (single-digit exact-match rates) and roughly double performance when helpful cross-file snippets are retrieved and prepended, but peak exact-match remains low, showing room for better retrieval and model

Problem Statement

Most code-completion benchmarks give only a single-file context. Real projects span many files and require cross-file knowledge (APIs, classes, helpers). Current benchmarks understate models' need to fetch and use other files, so we need a benchmark that forces cross-file reasoning and measures retrieval+generation performance.

Main Contribution

CrossCodeEval: a multilingual benchmark with ~10k cross-file completion examples drawn from ~1k permissively licensed GitHub repos in Python, Java, TypeScript, and C#.

A simple, automated static-analysis method to find completions that strictly require cross-file context: replace intra-project imports with empty classes and detect undefined-name errors.

A careful quality pipeline: rule-based filters, model-based filtering (StarCoderBase-1B), and human annotation showing nearly all samples require cross-file lookup.

Comprehensive evaluation: zero-shot results for multiple code LMs and a comparison of sparse and neural retrieval methods (BM25, UniXCoder, OpenAI ada embeddings).

Key Findings

Off-the-shelf models fail on cross-file examples when only given the current file.

NumbersStarCoder-15.5B Python EM 8.82% (in-file only)

Prepending retrieved cross-file context roughly doubles exact-match rates for many models.

NumbersCodeGen25-7B Python EM: 7.73% → 14.52% (+~1.9×) with BM25 retrieval

Even with retrieval and an oracle-style retrieval (using the reference), peak exact-match remains far from perfect.

NumbersStarCoder-15.5B Python EM: 8.82% → 15.72% (retrieval) → 21.01% (retrieval w/ ref)

Dataset quality checks show examples truly need cross-file info.

NumbersHuman annotation: Q1 (needs cross-file) Python 98%, Java 100%; Q2 (predictable from file) only ~2% cases

Results

Code Match EM (Python, StarCoder-15.5B)

Value8.82% (in-file) → 15.72% (retrieval) → 21.01% (retrieval w/ ref)

Baseline8.82% (in-file only)

Code Match EM (Python, CodeGen25-7B)

Value7.73% (in-file) → 14.52% (BM25 retrieval) → 19.17% (retrieval w/ ref)

Baseline7.73% (in-file only)

Retrieval comparison (CodeGen25-7B, Python EM)

ValueBM25 14.52% → UniXCoder 13.73% → OpenAI ada 14.82% (retrieval)

Baselinein-file 7.73%

Human annotation: need for cross-file

ValuePython: 98% annotators said reference requires cross-file; Java: 100%

Who Should Care

What To Try In 7 Days

Add a repository retrieval step (BM25 or embeddings) that returns top-5 10-line chunks and prepend them to the model prompt.

Measure both exact-match and identifier-match for your codebase; identifier overlap strongly correlates with success.

Run a small human check: replace an import with a dummy class to find cross-file-required completions in your repo.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation is zero-shot; few-shot or finetuning effects are not tested and may change results.
  • Retrieval quality is a bottleneck: fixed 10-line chunks and token-based similarity can return unhelpful snippets.
  • Potential memorization: authors filtered many repositories but cannot fully guarantee models never saw similar content during pretraining.

When Not To Use

  • If you only need single-file completion benchmarks, a simpler dataset suffices.
  • When evaluating models fine-tuned with repository-level context (few-shot/fine-tuned) without adjusting prompt length limits.
  • For closed-source private repos where licensing or privacy prevents sharing retrieval chunks.

Failure Modes

  • Retrieval returns irrelevant code and degrades completion (model Hallucinates new APIs).
  • Truncation of retrieved context can cut out the crucial lines and mislead the model.
  • Dataset still contains some long strings or hard-to-predict literals that cause annotator disagreement.

Core Entities

Models

  • CodeGen
  • CodeGen-350M
  • CodeGen-2.7B
  • CodeGen-6.1B
  • CodeGen-16.1B
  • CodeGen25-7B
  • StarCoder
  • StarCoder-15.5B
  • StarCoderBase-1B
  • StarCoderBase-3B
  • StarCoderBase-7B
  • GPT-3.5-turbo
  • UniXCoder

Metrics

  • Exact Match (EM)
  • Edit Similarity (ES)
  • Identifier EM
  • Identifier F1

Datasets

  • CrossCodeEval (this paper)
  • The Stack (excluded during collection)

Benchmarks

  • Code Match EM/ES
  • Identifier Match EM/F1