Live, contamination-aware benchmark for code LLMs that tests generation, repair, execution, and test-output prediction

March 12, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.55

Citation Count

22

Authors

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

Links

Abstract / PDF

Why It Matters For Business

LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.

Summary TLDR

LiveCodeBench is a continuously updated code benchmark built from recent contest problems (LeetCode, AtCoder, CodeForces) to avoid dataset contamination and to evaluate multiple coding skills: code generation, self-repair, code execution, and test-output prediction. The dataset contains 511 problems (May 2023–May 2024) with ~17 tests per problem on average. The authors evaluate 52 models (18 base, 34 instruction-tuned) and show clear evidence of contamination in some models, strong correlations across scenarios (but meaningful differences), and that closed-access models still lead open models on this harder, live collection.

Problem Statement

Existing code benchmarks (HumanEval, MBPP, APPS) are limited: they focus mostly on natural-language-to-code, are small, and are at risk of being in model pretraining data (contamination). This makes comparison and generalization claims unreliable. LiveCodeBench builds a growing, time-tagged benchmark from contest problems and adds scenarios beyond simple generation to measure broader coding capabilities and avoid contamination.

Main Contribution

Live, time-stamped benchmark of contest problems (511 problems from May'23–May'24) to detect and avoid contamination.

Four evaluation scenarios: code generation, self-repair (debug from error feedback), code execution (predict program output), and test output prediction (predict expected outputs from problem statements).

High-quality curation: problems from LeetCode/AtCoder/CodeForces, ~17 tests per problem on average, generator-based test creation for hidden tests.

Large-scale evaluation: 52 models (18 base + 34 instruction-tuned), with public prompts, completions, and a toolkit promised for community use.

Key Findings

Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.

NumbersDS-Base-33B: Pass@1 ~60 (May) → ~0 (Sep) on LeetCode

Model performance is highly correlated across the four scenarios, but relative gaps change by task.

NumbersPairwise Pass@1 correlations >0.88; generation↔repair 0.98; execution↔test-output 0.96

Many fine-tuned open models overfit older, small benchmarks like HumanEval.

NumbersDS-Ins-1.3B: HumanEval+ Pass@1 59.8% vs LiveCodeBench-Easy 26.3%

Closed-access models lead open models on LiveCodeBench; a few large instruction-tuned open models narrow the gap.

NumbersLCB code-gen gap example: DS-Ins-33B is ~16.2 points behind GPT-4-Turbo on LCB (larger than on HumanEval)

Post-training (instruction tuning or SFT) improves LiveCodeBench performance.

NumbersL3-Ins-70B +8.2 pts; DS-Ins-33B +7.3 pts; Phind-34B +9.5 pts (improvement over base)

Results

Code generation Pass@1 (total across difficulties)

ValueGPT-4O 41.9%

Code generation Pass@1 (total across difficulties)

ValueGPT-4-Turbo-2024 41.1%

Code generation Pass@1 (total across difficulties)

ValueDSCoder-33b-Ins 21.8%

BaselineGPT-4-Turbo-2024 (41.1%)

Self-repair Pass@1 (total)

ValueGPT-4O 49.1%

Test output prediction Pass@1

ValueGPT-4O 68.9%

Code execution Pass@1 with Chain-of-Thought (COT)

ValueGPT-4O 91.0%

BaselineGPT-4-Turbo-2024 (83.8%)

Who Should Care

What To Try In 7 Days

Run your model on LiveCodeBench post-cutoff problems to check for contamination.

Compare performance on generation, repair, and execution to find tooling weaknesses to prioritize.

Add generator-based hidden tests (few adversarial cases) to existing internal benchmarks.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark size for post-cutoff evaluations is limited (349 problems post-Sep window), causing estimated 1–1.5% Pass@1 variance.
  • Currently Python-only; does not measure multi-language ability.
  • Prompting differences can affect results; prompts were not exhaustively tuned for every open model.
  • Problem domain focuses on contest programming, not open-ended real-world codebases.

When Not To Use

  • To evaluate non-Python language capabilities.
  • To judge performance on open-ended engineering tasks or large multi-file repos.
  • For tiny performance differences where 1–2% variance matters without statistical testing.

Failure Modes

  • Contamination can inflate model scores if post-cutoff filtering is not used.
  • Overfitting to small benchmarks (e.g., HumanEval) may not generalize to diverse problems.
  • Prompt sensitivity can change rankings, especially for open models.

Core Entities

Models

  • GPT-4-Turbo-2024-04-09
  • GPT-4O-2024-05-13
  • GPT-4-Turbo-1106
  • Claude-3-Opus
  • DS-Ins-33B
  • L3-Ins-70B
  • Mixtral
  • Codestral
  • StarCoder2
  • CodeLLaMa
  • Phind-34B

Metrics

  • Pass@1

Datasets

  • LiveCodeBench (511 problems May'23–May'24)
  • HumanEval+
  • CRUXEval

Benchmarks

  • HumanEval
  • HumanEval+
  • APPS
  • MBPP

Context Entities

Models

  • DeepSeek (DSCoder family)
  • Gemini-Pro-1.5
  • Mistral-Large
  • CodeQwen
  • LLama3

Metrics

  • Pass@k (sampling-based evaluation)

Datasets

  • LeetCode
  • AtCoder
  • CodeForces

Benchmarks

  • CRUXEval
  • CodeContests