xCodeEval — a 7-task, execution-first benchmark with millions of runnable, multilingual code examples

March 6, 20239 min

Overview

Decision SnapshotNeeds Validation

The dataset and ExecEval provide a practical way to measure functional correctness across languages; expect extra engineering to run heavy evaluation and to check for pretraining leakage.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/8

Reproducibility

Status: Code + data available

Open source: Partial

License: CC BY-NC 4.0

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, Shafiq Joty

Links

Abstract / PDF / Code / Data

Why It Matters For Business

xCodeEval measures real functional correctness across many languages. Use it to benchmark developer-facing code assistants, choose runtimes for evaluation, and avoid over-relying on lexical metrics that miss runtime failures.

Who Should Care

Summary TLDR

xCodeEval is a large execution-based benchmark for code models. It collects tens of millions of document-level solutions (reported as 25M (16.5B tokens) and also described as 20M in the intro) across ~7.5K algorithmic problems and up to 17 programming languages. The benchmark defines 7 tasks (tag classification, compilation prediction, program synthesis, program repair, code translation, NL→code retrieval, code→code retrieval) and evaluates correctness by running unit tests via ExecEval, a distributed, Dockerized multi-language execution engine. Baselines (gpt-3.5 / StarEncoder / fine-tuned Starcoderbase) show the suite is challenging: synthesis pass@5 for ChatGPT averages ~27.8% on XCODEEAL

Problem Statement

Current code benchmarks are fragmented: they cover few languages, use lexical metrics (not execution), focus on small scopes (statements/functions), and often lack balanced, leakage-free splits. This prevents robust, multilingual, execution-level evaluation of models. XCODEEVAL creates a large, runnable, multilingual, multitask testbed and a secure execution engine to measure real functional correctness.

Main Contribution

A very large executable code corpus: reported as 25M document-level samples (16.5B tokens) from ~7.5K unique algorithmic problems, spanning many languages and tasks.

Seven tasks covering classification, generation, translation, and retrieval — all evaluated at execution level where applicable.

Key Findings

xCodeEval is large and multilingual.

Numbers25M samples; 16.5B tokens; ~7.5K problems; up to 17 languages (Table 8, Abstract)

Practical UseUse this dataset when you need many runnable examples across languages for pretraining, fine-tuning, or robust multilingual evaluation.

Evidence RefAbstract, Table 8

Execution engine supports many runtimes and explicit outcomes.

NumbersExecEval supports 44 compiler/interpreter versions across 11 languages (Table 10, Sec 2.2)

Practical UseYou can run unit tests in a reproducible, sandboxed environment to measure true correctness rather than code-text overlap.

Evidence RefSection 2.2, Table 10

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Tag Classification (Desc+Code) macro-F1 average33.6gpt-3.5-turbo zero-shotvalidation (per-language averages, Table 3)DesCode2Tag macro-F1 avg = 33.6 (Table 3)Table 3
Accuracy63.27gpt-3.5-turbo zero-shotvalidation (11 languages, Table 3)Compilation accuracy average = 63.27% (Table 3)Table 3

What To Try In 7 Days

Run ExecEval on a small set of model outputs to compare functional pass rates instead of BLEU/CodeBLEU.

Fine-tune a 3B model on xCodeEval training split for your target language and measure pass@k improvements.

Add an execution-based retrieval step to your code search pipeline to filter candidates that fail unit tests.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCC BY-NC 4.0

Risks & Boundaries

Limitations

Single source: all data comes from Codeforces, so domain diversity is limited (Sec 5).

Language imbalance: some languages have many more samples than others (Appendix, Sec E).

When Not To Use

If you need human-audited, privacy-scrubbed code for production security reviews.

For benchmarks focusing only on short snippets, local reasoning, or API usage (xCodeEval is document/global-level).

Failure Modes

Unit tests may not capture all corner cases; a PASSED label only means tests supplied were satisfied.

ExecEval environment differences (timelimit, memory, compiler flags) can change outcomes across setups.

Core Entities

Models

gpt-3.5-turbo-0301 (ChatGPT)StarEncoder (fine-tuned retriever)starcoderbase-3B (fine-tuned)CodeLlama-7b-InstructCodeLlama-13b-Instruct

Metrics

pass@k (functional correctness)macro-F1 (tag classification)Accuracy

Datasets

xCodeEval (NTU-NLP-sg/xCodeEval on Hugging Face)ExecEval (execution engine; GitHub)

Benchmarks

HumanEvalAPPSCodeContestsMBPPHumanEval-XTransCoder(-ST)