Live, contamination-aware benchmark for code LLMs that tests generation, repair, execution, and test-output prediction

Overview

Decision SnapshotReady For Pilot

The paper offers a practical live benchmark with strong empirical evidence of contamination and cross-scenario behavior; it is useful now for realistic model comparisons but has dataset-size and domain limits.

Citations22

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 70%

Novelty: 65%

Authors

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO Data Scientist

Summary TLDR

LiveCodeBench is a continuously updated code benchmark built from recent contest problems (LeetCode, AtCoder, CodeForces) to avoid dataset contamination and to evaluate multiple coding skills: code generation, self-repair, code execution, and test-output prediction. The dataset contains 511 problems (May 2023–May 2024) with ~17 tests per problem on average. The authors evaluate 52 models (18 base, 34 instruction-tuned) and show clear evidence of contamination in some models, strong correlations across scenarios (but meaningful differences), and that closed-access models still lead open models on this harder, live collection.

Problem Statement

Existing code benchmarks (HumanEval, MBPP, APPS) are limited: they focus mostly on natural-language-to-code, are small, and are at risk of being in model pretraining data (contamination). This makes comparison and generalization claims unreliable. LiveCodeBench builds a growing, time-tagged benchmark from contest problems and adds scenarios beyond simple generation to measure broader coding capabilities and avoid contamination.

Main Contribution

Live, time-stamped benchmark of contest problems (511 problems from May'23–May'24) to detect and avoid contamination.

Four evaluation scenarios: code generation, self-repair (debug from error feedback), code execution (predict program output), and test output prediction (predict expected outputs from problem statements).

Key Findings

Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.

NumbersDS-Base-33B: Pass@1 ~60 (May) → ~0 (Sep) on LeetCode

Practical UseFilter evaluation problems by model cutoff date or use post-cutoff time windows to avoid overstating model ability.

Evidence RefSection 5.1, Figure 1

Model performance is highly correlated across the four scenarios, but relative gaps change by task.

NumbersPairwise Pass@1 correlations >0.88; generation↔repair 0.98; execution↔test-output 0.96

Practical UseEvaluate models on multiple scenarios (not only generation) to surface strengths like execution or self-repair.

Evidence RefSection 5.2, Figure 13

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Code generation Pass@1 (total across difficulties)	GPT-4O 41.9%	—	—	LiveCodeBench (May'23–May'24)	Table 3: GPT-4O-2024-05-13 total Pass@1 = 41.9	Table 3
Code generation Pass@1 (total across difficulties)	GPT-4-Turbo-2024 41.1%	—	—	LiveCodeBench (May'23–May'24)	Table 3: GPT-4-Turbo-2024-04-09 total Pass@1 = 41.1	Table 3

What To Try In 7 Days

Run your model on LiveCodeBench post-cutoff problems to check for contamination.

Compare performance on generation, repair, and execution to find tooling weaknesses to prioritize.

Add generator-based hidden tests (few adversarial cases) to existing internal benchmarks.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://livecodebench.github.io/

Data URLs

https://livecodebench.github.io/

Risks & Boundaries

Limitations

Benchmark size for post-cutoff evaluations is limited (349 problems post-Sep window), causing estimated 1–1.5% Pass@1 variance.

Currently Python-only; does not measure multi-language ability.

When Not To Use

To evaluate non-Python language capabilities.

To judge performance on open-ended engineering tasks or large multi-file repos.

Failure Modes

Contamination can inflate model scores if post-cutoff filtering is not used.

Overfitting to small benchmarks (e.g., HumanEval) may not generalize to diverse problems.

Core Entities

Models

GPT-4-Turbo-2024-04-09GPT-4O-2024-05-13GPT-4-Turbo-1106Claude-3-OpusDS-Ins-33BL3-Ins-70BMixtralCodestralStarCoder2CodeLLaMaPhind-34B

Metrics

Pass@1

Datasets

LiveCodeBench (511 problems May'23–May'24)HumanEval+CRUXEval

Benchmarks

HumanEvalHumanEval+APPSMBPP

Context Entities

Models

DeepSeek (DSCoder family)Gemini-Pro-1.5Mistral-LargeCodeQwenLLama3

Metrics

Pass@k (sampling-based evaluation)

Datasets

LeetCodeAtCoderCodeForces

Benchmarks

CRUXEvalCodeContests

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.

Model performance is highly correlated across the four scenarios, but relative gaps change by task.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding