Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.55
Citation Count
22
Why It Matters For Business
LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.
Summary TLDR
LiveCodeBench is a continuously updated code benchmark built from recent contest problems (LeetCode, AtCoder, CodeForces) to avoid dataset contamination and to evaluate multiple coding skills: code generation, self-repair, code execution, and test-output prediction. The dataset contains 511 problems (May 2023–May 2024) with ~17 tests per problem on average. The authors evaluate 52 models (18 base, 34 instruction-tuned) and show clear evidence of contamination in some models, strong correlations across scenarios (but meaningful differences), and that closed-access models still lead open models on this harder, live collection.
Problem Statement
Existing code benchmarks (HumanEval, MBPP, APPS) are limited: they focus mostly on natural-language-to-code, are small, and are at risk of being in model pretraining data (contamination). This makes comparison and generalization claims unreliable. LiveCodeBench builds a growing, time-tagged benchmark from contest problems and adds scenarios beyond simple generation to measure broader coding capabilities and avoid contamination.
Main Contribution
Live, time-stamped benchmark of contest problems (511 problems from May'23–May'24) to detect and avoid contamination.
Four evaluation scenarios: code generation, self-repair (debug from error feedback), code execution (predict program output), and test output prediction (predict expected outputs from problem statements).
High-quality curation: problems from LeetCode/AtCoder/CodeForces, ~17 tests per problem on average, generator-based test creation for hidden tests.
Large-scale evaluation: 52 models (18 base + 34 instruction-tuned), with public prompts, completions, and a toolkit promised for community use.
Key Findings
Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.
Model performance is highly correlated across the four scenarios, but relative gaps change by task.
Many fine-tuned open models overfit older, small benchmarks like HumanEval.
Closed-access models lead open models on LiveCodeBench; a few large instruction-tuned open models narrow the gap.
Post-training (instruction tuning or SFT) improves LiveCodeBench performance.
Results
Code generation Pass@1 (total across difficulties)
Code generation Pass@1 (total across difficulties)
Code generation Pass@1 (total across difficulties)
Self-repair Pass@1 (total)
Test output prediction Pass@1
Code execution Pass@1 with Chain-of-Thought (COT)
Who Should Care
What To Try In 7 Days
Run your model on LiveCodeBench post-cutoff problems to check for contamination.
Compare performance on generation, repair, and execution to find tooling weaknesses to prioritize.
Add generator-based hidden tests (few adversarial cases) to existing internal benchmarks.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark size for post-cutoff evaluations is limited (349 problems post-Sep window), causing estimated 1–1.5% Pass@1 variance.
- Currently Python-only; does not measure multi-language ability.
- Prompting differences can affect results; prompts were not exhaustively tuned for every open model.
- Problem domain focuses on contest programming, not open-ended real-world codebases.
When Not To Use
- To evaluate non-Python language capabilities.
- To judge performance on open-ended engineering tasks or large multi-file repos.
- For tiny performance differences where 1–2% variance matters without statistical testing.
Failure Modes
- Contamination can inflate model scores if post-cutoff filtering is not used.
- Overfitting to small benchmarks (e.g., HumanEval) may not generalize to diverse problems.
- Prompt sensitivity can change rankings, especially for open models.
Core Entities
Models
- GPT-4-Turbo-2024-04-09
- GPT-4O-2024-05-13
- GPT-4-Turbo-1106
- Claude-3-Opus
- DS-Ins-33B
- L3-Ins-70B
- Mixtral
- Codestral
- StarCoder2
- CodeLLaMa
- Phind-34B
Metrics
- Pass@1
Datasets
- LiveCodeBench (511 problems May'23–May'24)
- HumanEval+
- CRUXEval
Benchmarks
- HumanEval
- HumanEval+
- APPS
- MBPP
Context Entities
Models
- DeepSeek (DSCoder family)
- Gemini-Pro-1.5
- Mistral-Large
- CodeQwen
- LLama3
Metrics
- Pass@k (sampling-based evaluation)
Datasets
- LeetCode
- AtCoder
- CodeForces
Benchmarks
- CRUXEval
- CodeContests

