LiveBench: a live, ground-truth-scored benchmark that resists test-set contamination

June 27, 20247 min

Overview

Decision SnapshotReady For Pilot

LiveBench is production-ready for benchmarking and longitudinal tracking; it is novel in combining live question sources with ground-truth scoring and monthly refreshes.

Citations18

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A frequently-updated, ground-truth-scored benchmark prevents inflated claims from contaminated test data and shows real capability gaps—use it to validate model improvements and guard against overfitting to public test sets.

Who Should Care

Summary TLDR

LiveBench is a 1,000-question benchmark that updates monthly to limit test-set contamination. It focuses on objectively verifiable tasks across six categories (math, coding, reasoning, language, instruction following, data analysis). Scoring is automatic against ground truth (no LLM judge) and tasks are drawn from recent sources (news, arXiv, competitions). Top models score below 70%, showing sustained difficulty. The authors open-source questions, code, and model outputs and plan to refresh roughly 1/6 of items each month to stay contamination-resistant.

Problem Statement

Test-set contamination and judge bias make many published LLM benchmarks unreliable. The paper builds a frequently-updated benchmark with only objectively scorable questions and automated grading to reduce contamination and judging bias while remaining challenging.

Main Contribution

Design and release of LiveBench: a 1,000-question, contamination-limited benchmark across six task categories.

Automatic, ground-truth scoring (no LLM judges) to avoid judge bias and highlight real capability gaps.

Key Findings

Top models perform below human-like saturation on LiveBench.

NumbersTop LiveBench score 64.7% (o1-preview-2024-09-12).

Practical UseExpect hard, real-world-styled tasks to expose gaps; don't trust single-benchmark claims of near-perfect LLM ability.

Evidence RefTable 1

LLM judges make many mistakes on hard math and reasoning.

NumbersLLM-as-judge error rates 2146% on AMC/AIME/SMC/Zebra (e.g., 38% on AMC12).

Practical UseAvoid using LLM judges for hard verification tasks; prefer ground-truth or human experts for correctness checks.

Evidence RefTable 8 and Table 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Top LiveBench score (overall)64.7%LiveBench (all tasks)Table 1 shows o1-preview-2024-09-12 at 64.7%Table 1
Per-category top scores (examples)Instruction following 80.1%, Language 68.7%, Coding 67.1% (maxes shown across models)LiveBench categoriesTable 1 category columns (examples: instruction following, language, coding).Table 1

What To Try In 7 Days

Run your models on LiveBench to get a contamination-resistant baseline and compare to public leaderboards.

Add a math/reasoning sample from LiveBench when validating releases—these correlate strongly with overall capability.

Avoid relying on LLM-based automatic judges for hard correctness checks; use ground-truth scoring or human experts for verification.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseApache-2.0

Risks & Boundaries

Limitations

Not all tasks are contamination-free; some coding and AMC items have limited residual contamination (A.7).

Ground-truth scoring cannot evaluate open-ended subjective tasks (e.g., travel guides).

When Not To Use

For subjective, preference-driven evaluation (tone, style, creativity).

When you need long-term private test sets without any public release of questions (LiveBench releases private questions only briefly).

Failure Modes

Automated regex scoring may miss unconventional but correct answer formats (authors mitigate with permissive parsing and manual checks).

Residual contamination can still bias some models on lightly modified public problems.

Core Entities

Models

o1-preview-2024-09-12claude-3-5-sonnet-20240620o1-mini-2024-09-12gemini-1.5-pro-002meta-llama-3.1-405b-instructqwen2.5-72b-instructgpt-4o-2024-08-06gpt-4-turbo-2024-04-09phi-3.5-moe-instructmixtral-8x22b-instruct-v0.1

Metrics

LiveBench Score (average across 6 categories)Pass@1 (coding generation/completion)AccuracyF1 (table join prediction)Levenshtein-based ordering score (plot unscramble)LLM judge error rate (ablation)

Datasets

AMC12 2023AIME 2024SMC 2023USAMO/IMO 2024AMPS_Hard (synthetic)LeetCode/AtCoder (LiveCodeBench)Kaggle datasetsSocrata datasetsArXiv abstractsThe Guardian articlesIMDb/Wikipedia plot synopses

Benchmarks

Big-Bench HardIFEvalLiveCodeBenchChatBot ArenaArena-Hard