Live, contamination-aware benchmark for code LLMs that tests generation, repair, execution, and test-output prediction

March 12, 20247 min

Overview

Decision SnapshotReady For Pilot

The paper offers a practical live benchmark with strong empirical evidence of contamination and cross-scenario behavior; it is useful now for realistic model comparisons but has dataset-size and domain limits.

Citations22

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 70%

Novelty: 65%

Authors

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.

Who Should Care

Summary TLDR

LiveCodeBench is a continuously updated code benchmark built from recent contest problems (LeetCode, AtCoder, CodeForces) to avoid dataset contamination and to evaluate multiple coding skills: code generation, self-repair, code execution, and test-output prediction. The dataset contains 511 problems (May 2023–May 2024) with ~17 tests per problem on average. The authors evaluate 52 models (18 base, 34 instruction-tuned) and show clear evidence of contamination in some models, strong correlations across scenarios (but meaningful differences), and that closed-access models still lead open models on this harder, live collection.

Problem Statement

Existing code benchmarks (HumanEval, MBPP, APPS) are limited: they focus mostly on natural-language-to-code, are small, and are at risk of being in model pretraining data (contamination). This makes comparison and generalization claims unreliable. LiveCodeBench builds a growing, time-tagged benchmark from contest problems and adds scenarios beyond simple generation to measure broader coding capabilities and avoid contamination.

Main Contribution

Live, time-stamped benchmark of contest problems (511 problems from May'23–May'24) to detect and avoid contamination.

Four evaluation scenarios: code generation, self-repair (debug from error feedback), code execution (predict program output), and test output prediction (predict expected outputs from problem statements).

Key Findings

Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.

NumbersDS-Base-33B: Pass@1 ~60 (May) → ~0 (Sep) on LeetCode

Practical UseFilter evaluation problems by model cutoff date or use post-cutoff time windows to avoid overstating model ability.

Evidence RefSection 5.1, Figure 1

Model performance is highly correlated across the four scenarios, but relative gaps change by task.

NumbersPairwise Pass@1 correlations >0.88; generation↔repair 0.98; execution↔test-output 0.96

Practical UseEvaluate models on multiple scenarios (not only generation) to surface strengths like execution or self-repair.

Evidence RefSection 5.2, Figure 13

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Code generation Pass@1 (total across difficulties)GPT-4O 41.9%LiveCodeBench (May'23–May'24)Table 3: GPT-4O-2024-05-13 total Pass@1 = 41.9Table 3
Code generation Pass@1 (total across difficulties)GPT-4-Turbo-2024 41.1%LiveCodeBench (May'23–May'24)Table 3: GPT-4-Turbo-2024-04-09 total Pass@1 = 41.1Table 3

What To Try In 7 Days

Run your model on LiveCodeBench post-cutoff problems to check for contamination.

Compare performance on generation, repair, and execution to find tooling weaknesses to prioritize.

Add generator-based hidden tests (few adversarial cases) to existing internal benchmarks.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark size for post-cutoff evaluations is limited (349 problems post-Sep window), causing estimated 1–1.5% Pass@1 variance.

Currently Python-only; does not measure multi-language ability.

When Not To Use

To evaluate non-Python language capabilities.

To judge performance on open-ended engineering tasks or large multi-file repos.

Failure Modes

Contamination can inflate model scores if post-cutoff filtering is not used.

Overfitting to small benchmarks (e.g., HumanEval) may not generalize to diverse problems.

Core Entities

Models

GPT-4-Turbo-2024-04-09GPT-4O-2024-05-13GPT-4-Turbo-1106Claude-3-OpusDS-Ins-33BL3-Ins-70BMixtralCodestralStarCoder2CodeLLaMaPhind-34B

Metrics

Pass@1

Datasets

LiveCodeBench (511 problems May'23–May'24)HumanEval+CRUXEval

Benchmarks

HumanEvalHumanEval+APPSMBPP

Context Entities

Models

DeepSeek (DSCoder family)Gemini-Pro-1.5Mistral-LargeCodeQwenLLama3

Metrics

Pass@k (sampling-based evaluation)

Datasets

LeetCodeAtCoderCodeForces

Benchmarks

CRUXEvalCodeContests