xCodeEval — a 7-task, execution-first benchmark with millions of runnable, multilingual code examples

March 6, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

10

Authors

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, Shafiq Joty

Links

Abstract / PDF

Why It Matters For Business

xCodeEval measures real functional correctness across many languages. Use it to benchmark developer-facing code assistants, choose runtimes for evaluation, and avoid over-relying on lexical metrics that miss runtime failures.

Summary TLDR

xCodeEval is a large execution-based benchmark for code models. It collects tens of millions of document-level solutions (reported as 25M (16.5B tokens) and also described as 20M in the intro) across ~7.5K algorithmic problems and up to 17 programming languages. The benchmark defines 7 tasks (tag classification, compilation prediction, program synthesis, program repair, code translation, NL→code retrieval, code→code retrieval) and evaluates correctness by running unit tests via ExecEval, a distributed, Dockerized multi-language execution engine. Baselines (gpt-3.5 / StarEncoder / fine-tuned Starcoderbase) show the suite is challenging: synthesis pass@5 for ChatGPT averages ~27.8% on XCODEEAL

Problem Statement

Current code benchmarks are fragmented: they cover few languages, use lexical metrics (not execution), focus on small scopes (statements/functions), and often lack balanced, leakage-free splits. This prevents robust, multilingual, execution-level evaluation of models. XCODEEVAL creates a large, runnable, multilingual, multitask testbed and a secure execution engine to measure real functional correctness.

Main Contribution

A very large executable code corpus: reported as 25M document-level samples (16.5B tokens) from ~7.5K unique algorithmic problems, spanning many languages and tasks.

Seven tasks covering classification, generation, translation, and retrieval — all evaluated at execution level where applicable.

ExecEval: a Dockerized, distributed execution engine that supports 44 compiler/interpreter versions across 11 languages and returns detailed execution outcomes.

A graph-theoretic, flow-based data selection and splitting method to balance tags/problems/outcomes between train/validation/test and reduce sampling skew.

Key Findings

xCodeEval is large and multilingual.

Numbers25M samples; 16.5B tokens; ~7.5K problems; up to 17 languages (Table 8, Abstract)

Execution engine supports many runtimes and explicit outcomes.

NumbersExecEval supports 44 compiler/interpreter versions across 11 languages (Table 10, Sec 2.2)

Strong LLMs find XCODEEVAL substantially harder than prior execution benchmarks.

NumbersChatGPT avg pass@5 ≈ 27.8% on program synthesis vs. reported 65.8% pass@1 on HumanEval (paper cites OpenAI figures)

Retrieval with executability is feasible and effective with dense encoders.

NumbersStarEncoder NL→Code Acc@k = 83.83%; Code-Code α = 56.43%, γ = 68.66% (Table 4)

Temperature strongly affects executable output.

NumbersBest pass rates observed at temperature ≈ 0.32 from a sweep 0.0–2.0 (20 temps) (Sec 3.2)

Results

Tag Classification (Desc+Code) macro-F1 average

Value33.6

Baselinegpt-3.5-turbo zero-shot

Accuracy

Value63.27

Baselinegpt-3.5-turbo zero-shot

Program Synthesis pass@5 (ChatGPT) average

Value27.8

Baselinegpt-3.5-turbo zero-shot (n/T settings)

Automatic Program Repair pass@5 (ChatGPT) average

Value55.07

Baselinegpt-3.5-turbo zero-shot

Code Translation pass@5 (target-language average)

Value45.08

Baselinegpt-3.5-turbo zero-shot

NL→Code retrieval Acc@k (StarEncoder)

Value83.83

BaselineStarEncoder fine-tuned

Code→Code retrieval Acc@k (StarEncoder) averages

Valueα=56.43, γ=68.66

BaselineStarEncoder fine-tuned

Smaller model fine-tune effect (starcoderbase-3B)

Valuepass@5 avg = 2.25

BaselineCodeLlama-13b-Instruct pass@5 avg = 3.81

Who Should Care

What To Try In 7 Days

Run ExecEval on a small set of model outputs to compare functional pass rates instead of BLEU/CodeBLEU.

Fine-tune a 3B model on xCodeEval training split for your target language and measure pass@k improvements.

Add an execution-based retrieval step to your code search pipeline to filter candidates that fail unit tests.

Reproducibility

License

  • CC BY-NC 4.0

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single source: all data comes from Codeforces, so domain diversity is limited (Sec 5).
  • Language imbalance: some languages have many more samples than others (Appendix, Sec E).
  • Document-level, often non-modular solutions without docstrings reduce realism for some tooling.
  • Possible pretraining data leakage: models with unknown cutoffs may have seen parts of this corpus (Sec 3.2, K).
  • Not human audited: automated filters removed ~2M samples, but sensitive data or vulnerabilities may remain (Sec 6).

When Not To Use

  • If you need human-audited, privacy-scrubbed code for production security reviews.
  • For benchmarks focusing only on short snippets, local reasoning, or API usage (xCodeEval is document/global-level).
  • If you require a fully leakage-free test set without knowledge of pretraining cutoffs.

Failure Modes

  • Unit tests may not capture all corner cases; a PASSED label only means tests supplied were satisfied.
  • ExecEval environment differences (timelimit, memory, compiler flags) can change outcomes across setups.
  • Large models might memorize training data; reported improvements could reflect leakage, not generalization.
  • Retrieval is sensitive to corpus size and hard negatives, lowering monolingual accuracy for large corpora.

Core Entities

Models

  • gpt-3.5-turbo-0301 (ChatGPT)
  • StarEncoder (fine-tuned retriever)
  • starcoderbase-3B (fine-tuned)
  • CodeLlama-7b-Instruct
  • CodeLlama-13b-Instruct

Metrics

  • pass@k (functional correctness)
  • macro-F1 (tag classification)
  • Accuracy

Datasets

  • xCodeEval (NTU-NLP-sg/xCodeEval on Hugging Face)
  • ExecEval (execution engine; GitHub)

Benchmarks

  • HumanEval
  • APPS
  • CodeContests
  • MBPP
  • HumanEval-X
  • TransCoder(-ST)