xCodeEval — a 7-task, execution-first benchmark with millions of runnable, multilingual code examples

Overview

Decision SnapshotNeeds Validation

The dataset and ExecEval provide a practical way to measure functional correctness across languages; expect extra engineering to run heavy evaluation and to check for pretraining leakage.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/8

Reproducibility

Status: Code + data available

Open source: Partial

License: CC BY-NC 4.0

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, Shafiq Joty

Links

Abstract / PDF / Code / Data

Why It Matters For Business

xCodeEval measures real functional correctness across many languages. Use it to benchmark developer-facing code assistants, choose runtimes for evaluation, and avoid over-relying on lexical metrics that miss runtime failures.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

xCodeEval is a large execution-based benchmark for code models. It collects tens of millions of document-level solutions (reported as 25M (16.5B tokens) and also described as 20M in the intro) across ~7.5K algorithmic problems and up to 17 programming languages. The benchmark defines 7 tasks (tag classification, compilation prediction, program synthesis, program repair, code translation, NL→code retrieval, code→code retrieval) and evaluates correctness by running unit tests via ExecEval, a distributed, Dockerized multi-language execution engine. Baselines (gpt-3.5 / StarEncoder / fine-tuned Starcoderbase) show the suite is challenging: synthesis pass@5 for ChatGPT averages ~27.8% on XCODEEAL

Problem Statement

Current code benchmarks are fragmented: they cover few languages, use lexical metrics (not execution), focus on small scopes (statements/functions), and often lack balanced, leakage-free splits. This prevents robust, multilingual, execution-level evaluation of models. XCODEEVAL creates a large, runnable, multilingual, multitask testbed and a secure execution engine to measure real functional correctness.

Main Contribution

A very large executable code corpus: reported as 25M document-level samples (16.5B tokens) from ~7.5K unique algorithmic problems, spanning many languages and tasks.

Seven tasks covering classification, generation, translation, and retrieval — all evaluated at execution level where applicable.

Key Findings

xCodeEval is large and multilingual.

Numbers25M samples; 16.5B tokens; ~7.5K problems; up to 17 languages (Table 8, Abstract)

Practical UseUse this dataset when you need many runnable examples across languages for pretraining, fine-tuning, or robust multilingual evaluation.

Evidence RefAbstract, Table 8

Execution engine supports many runtimes and explicit outcomes.

NumbersExecEval supports 44 compiler/interpreter versions across 11 languages (Table 10, Sec 2.2)

Practical UseYou can run unit tests in a reproducible, sandboxed environment to measure true correctness rather than code-text overlap.

Evidence RefSection 2.2, Table 10

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Tag Classification (Desc+Code) macro-F1 average	33.6	gpt-3.5-turbo zero-shot	—	validation (per-language averages, Table 3)	DesCode2Tag macro-F1 avg = 33.6 (Table 3)	Table 3
Accuracy	63.27	gpt-3.5-turbo zero-shot	—	validation (11 languages, Table 3)	Compilation accuracy average = 63.27% (Table 3)	Table 3

What To Try In 7 Days

Run ExecEval on a small set of model outputs to compare functional pass rates instead of BLEU/CodeBLEU.

Fine-tune a 3B model on xCodeEval training split for your target language and measure pass@k improvements.

Add an execution-based retrieval step to your code search pipeline to filter candidates that fail unit tests.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCC BY-NC 4.0

Code URLs

https://github.com/ntunlp/xCodeEval https://github.com/ntunlp/ExecEval

Data URLs

https://huggingface.co/datasets/NTU-NLP-sg/xCodeEval

Risks & Boundaries

Limitations

Single source: all data comes from Codeforces, so domain diversity is limited (Sec 5).

Language imbalance: some languages have many more samples than others (Appendix, Sec E).

When Not To Use

If you need human-audited, privacy-scrubbed code for production security reviews.

For benchmarks focusing only on short snippets, local reasoning, or API usage (xCodeEval is document/global-level).

Failure Modes

Unit tests may not capture all corner cases; a PASSED label only means tests supplied were satisfied.

ExecEval environment differences (timelimit, memory, compiler flags) can change outcomes across setups.

Core Entities

Models

gpt-3.5-turbo-0301 (ChatGPT)StarEncoder (fine-tuned retriever)starcoderbase-3B (fine-tuned)CodeLlama-7b-InstructCodeLlama-13b-Instruct

Metrics

pass@k (functional correctness)macro-F1 (tag classification)Accuracy

Datasets

xCodeEval (NTU-NLP-sg/xCodeEval on Hugging Face)ExecEval (execution engine; GitHub)

Benchmarks

HumanEvalAPPSCodeContestsMBPPHumanEval-XTransCoder(-ST)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

xCodeEval is large and multilingual.

Execution engine supports many runtimes and explicit outcomes.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding