Overview
The dataset and ExecEval provide a practical way to measure functional correctness across languages; expect extra engineering to run heavy evaluation and to check for pretraining leakage.
Citations10
Evidence Strength0.80
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/8
Reproducibility
Status: Code + data available
Open source: Partial
License: CC BY-NC 4.0
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
xCodeEval measures real functional correctness across many languages. Use it to benchmark developer-facing code assistants, choose runtimes for evaluation, and avoid over-relying on lexical metrics that miss runtime failures.
Who Should Care
Summary TLDR
xCodeEval is a large execution-based benchmark for code models. It collects tens of millions of document-level solutions (reported as 25M (16.5B tokens) and also described as 20M in the intro) across ~7.5K algorithmic problems and up to 17 programming languages. The benchmark defines 7 tasks (tag classification, compilation prediction, program synthesis, program repair, code translation, NL→code retrieval, code→code retrieval) and evaluates correctness by running unit tests via ExecEval, a distributed, Dockerized multi-language execution engine. Baselines (gpt-3.5 / StarEncoder / fine-tuned Starcoderbase) show the suite is challenging: synthesis pass@5 for ChatGPT averages ~27.8% on XCODEEAL
Problem Statement
Current code benchmarks are fragmented: they cover few languages, use lexical metrics (not execution), focus on small scopes (statements/functions), and often lack balanced, leakage-free splits. This prevents robust, multilingual, execution-level evaluation of models. XCODEEVAL creates a large, runnable, multilingual, multitask testbed and a secure execution engine to measure real functional correctness.
Main Contribution
A very large executable code corpus: reported as 25M document-level samples (16.5B tokens) from ~7.5K unique algorithmic problems, spanning many languages and tasks.
Seven tasks covering classification, generation, translation, and retrieval — all evaluated at execution level where applicable.
Key Findings
xCodeEval is large and multilingual.
Execution engine supports many runtimes and explicit outcomes.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Tag Classification (Desc+Code) macro-F1 average | 33.6 | gpt-3.5-turbo zero-shot | — | validation (per-language averages, Table 3) | DesCode2Tag macro-F1 avg = 33.6 (Table 3) | Table 3 |
| Accuracy | 63.27 | gpt-3.5-turbo zero-shot | — | validation (11 languages, Table 3) | Compilation accuracy average = 63.27% (Table 3) | Table 3 |
What To Try In 7 Days
Run ExecEval on a small set of model outputs to compare functional pass rates instead of BLEU/CodeBLEU.
Fine-tune a 3B model on xCodeEval training split for your target language and measure pass@k improvements.
Add an execution-based retrieval step to your code search pipeline to filter candidates that fail unit tests.
Reproducibility
Risks & Boundaries
Limitations
Single source: all data comes from Codeforces, so domain diversity is limited (Sec 5).
Language imbalance: some languages have many more samples than others (Appendix, Sec E).
When Not To Use
If you need human-audited, privacy-scrubbed code for production security reviews.
For benchmarks focusing only on short snippets, local reasoning, or API usage (xCodeEval is document/global-level).
Failure Modes
Unit tests may not capture all corner cases; a PASSED label only means tests supplied were satisfied.
ExecEval environment differences (timelimit, memory, compiler flags) can change outcomes across setups.

