Overview
The paper offers a practical live benchmark with strong empirical evidence of contamination and cross-scenario behavior; it is useful now for realistic model comparisons but has dataset-size and domain limits.
Citations22
Evidence Strength0.90
Confidence0.88
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 55%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.
Who Should Care
Summary TLDR
LiveCodeBench is a continuously updated code benchmark built from recent contest problems (LeetCode, AtCoder, CodeForces) to avoid dataset contamination and to evaluate multiple coding skills: code generation, self-repair, code execution, and test-output prediction. The dataset contains 511 problems (May 2023–May 2024) with ~17 tests per problem on average. The authors evaluate 52 models (18 base, 34 instruction-tuned) and show clear evidence of contamination in some models, strong correlations across scenarios (but meaningful differences), and that closed-access models still lead open models on this harder, live collection.
Problem Statement
Existing code benchmarks (HumanEval, MBPP, APPS) are limited: they focus mostly on natural-language-to-code, are small, and are at risk of being in model pretraining data (contamination). This makes comparison and generalization claims unreliable. LiveCodeBench builds a growing, time-tagged benchmark from contest problems and adds scenarios beyond simple generation to measure broader coding capabilities and avoid contamination.
Main Contribution
Live, time-stamped benchmark of contest problems (511 problems from May'23–May'24) to detect and avoid contamination.
Four evaluation scenarios: code generation, self-repair (debug from error feedback), code execution (predict program output), and test output prediction (predict expected outputs from problem statements).
Key Findings
Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.
Model performance is highly correlated across the four scenarios, but relative gaps change by task.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Code generation Pass@1 (total across difficulties) | GPT-4O 41.9% | — | — | LiveCodeBench (May'23–May'24) | Table 3: GPT-4O-2024-05-13 total Pass@1 = 41.9 | Table 3 |
| Code generation Pass@1 (total across difficulties) | GPT-4-Turbo-2024 41.1% | — | — | LiveCodeBench (May'23–May'24) | Table 3: GPT-4-Turbo-2024-04-09 total Pass@1 = 41.1 | Table 3 |
What To Try In 7 Days
Run your model on LiveCodeBench post-cutoff problems to check for contamination.
Compare performance on generation, repair, and execution to find tooling weaknesses to prioritize.
Add generator-based hidden tests (few adversarial cases) to existing internal benchmarks.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Benchmark size for post-cutoff evaluations is limited (349 problems post-Sep window), causing estimated 1–1.5% Pass@1 variance.
Currently Python-only; does not measure multi-language ability.
When Not To Use
To evaluate non-Python language capabilities.
To judge performance on open-ended engineering tasks or large multi-file repos.
Failure Modes
Contamination can inflate model scores if post-cutoff filtering is not used.
Overfitting to small benchmarks (e.g., HumanEval) may not generalize to diverse problems.

