Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
10
Why It Matters For Business
xCodeEval measures real functional correctness across many languages. Use it to benchmark developer-facing code assistants, choose runtimes for evaluation, and avoid over-relying on lexical metrics that miss runtime failures.
Summary TLDR
xCodeEval is a large execution-based benchmark for code models. It collects tens of millions of document-level solutions (reported as 25M (16.5B tokens) and also described as 20M in the intro) across ~7.5K algorithmic problems and up to 17 programming languages. The benchmark defines 7 tasks (tag classification, compilation prediction, program synthesis, program repair, code translation, NL→code retrieval, code→code retrieval) and evaluates correctness by running unit tests via ExecEval, a distributed, Dockerized multi-language execution engine. Baselines (gpt-3.5 / StarEncoder / fine-tuned Starcoderbase) show the suite is challenging: synthesis pass@5 for ChatGPT averages ~27.8% on XCODEEAL
Problem Statement
Current code benchmarks are fragmented: they cover few languages, use lexical metrics (not execution), focus on small scopes (statements/functions), and often lack balanced, leakage-free splits. This prevents robust, multilingual, execution-level evaluation of models. XCODEEVAL creates a large, runnable, multilingual, multitask testbed and a secure execution engine to measure real functional correctness.
Main Contribution
A very large executable code corpus: reported as 25M document-level samples (16.5B tokens) from ~7.5K unique algorithmic problems, spanning many languages and tasks.
Seven tasks covering classification, generation, translation, and retrieval — all evaluated at execution level where applicable.
ExecEval: a Dockerized, distributed execution engine that supports 44 compiler/interpreter versions across 11 languages and returns detailed execution outcomes.
A graph-theoretic, flow-based data selection and splitting method to balance tags/problems/outcomes between train/validation/test and reduce sampling skew.
Key Findings
xCodeEval is large and multilingual.
Execution engine supports many runtimes and explicit outcomes.
Strong LLMs find XCODEEVAL substantially harder than prior execution benchmarks.
Retrieval with executability is feasible and effective with dense encoders.
Temperature strongly affects executable output.
Results
Tag Classification (Desc+Code) macro-F1 average
Accuracy
Program Synthesis pass@5 (ChatGPT) average
Automatic Program Repair pass@5 (ChatGPT) average
Code Translation pass@5 (target-language average)
NL→Code retrieval Acc@k (StarEncoder)
Code→Code retrieval Acc@k (StarEncoder) averages
Smaller model fine-tune effect (starcoderbase-3B)
Who Should Care
What To Try In 7 Days
Run ExecEval on a small set of model outputs to compare functional pass rates instead of BLEU/CodeBLEU.
Fine-tune a 3B model on xCodeEval training split for your target language and measure pass@k improvements.
Add an execution-based retrieval step to your code search pipeline to filter candidates that fail unit tests.
Reproducibility
License
- CC BY-NC 4.0
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single source: all data comes from Codeforces, so domain diversity is limited (Sec 5).
- Language imbalance: some languages have many more samples than others (Appendix, Sec E).
- Document-level, often non-modular solutions without docstrings reduce realism for some tooling.
- Possible pretraining data leakage: models with unknown cutoffs may have seen parts of this corpus (Sec 3.2, K).
- Not human audited: automated filters removed ~2M samples, but sensitive data or vulnerabilities may remain (Sec 6).
When Not To Use
- If you need human-audited, privacy-scrubbed code for production security reviews.
- For benchmarks focusing only on short snippets, local reasoning, or API usage (xCodeEval is document/global-level).
- If you require a fully leakage-free test set without knowledge of pretraining cutoffs.
Failure Modes
- Unit tests may not capture all corner cases; a PASSED label only means tests supplied were satisfied.
- ExecEval environment differences (timelimit, memory, compiler flags) can change outcomes across setups.
- Large models might memorize training data; reported improvements could reflect leakage, not generalization.
- Retrieval is sensitive to corpus size and hard negatives, lowering monolingual accuracy for large corpora.
Core Entities
Models
- gpt-3.5-turbo-0301 (ChatGPT)
- StarEncoder (fine-tuned retriever)
- starcoderbase-3B (fine-tuned)
- CodeLlama-7b-Instruct
- CodeLlama-13b-Instruct
Metrics
- pass@k (functional correctness)
- macro-F1 (tag classification)
- Accuracy
Datasets
- xCodeEval (NTU-NLP-sg/xCodeEval on Hugging Face)
- ExecEval (execution engine; GitHub)
Benchmarks
- HumanEval
- APPS
- CodeContests
- MBPP
- HumanEval-X
- TransCoder(-ST)

