LiveBench: a live, ground-truth-scored benchmark that resists test-set contamination

Overview

Decision SnapshotReady For Pilot

LiveBench is production-ready for benchmarking and longitudinal tracking; it is novel in combining live question sources with ground-truth scoring and monthly refreshes.

Citations18

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A frequently-updated, ground-truth-scored benchmark prevents inflated claims from contaminated test data and shows real capability gaps—use it to validate model improvements and guard against overfitting to public test sets.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

LiveBench is a 1,000-question benchmark that updates monthly to limit test-set contamination. It focuses on objectively verifiable tasks across six categories (math, coding, reasoning, language, instruction following, data analysis). Scoring is automatic against ground truth (no LLM judge) and tasks are drawn from recent sources (news, arXiv, competitions). Top models score below 70%, showing sustained difficulty. The authors open-source questions, code, and model outputs and plan to refresh roughly 1/6 of items each month to stay contamination-resistant.

Problem Statement

Test-set contamination and judge bias make many published LLM benchmarks unreliable. The paper builds a frequently-updated benchmark with only objectively scorable questions and automated grading to reduce contamination and judging bias while remaining challenging.

Main Contribution

Design and release of LiveBench: a 1,000-question, contamination-limited benchmark across six task categories.

Automatic, ground-truth scoring (no LLM judges) to avoid judge bias and highlight real capability gaps.

Key Findings

Top models perform below human-like saturation on LiveBench.

NumbersTop LiveBench score 64.7% (o1-preview-2024-09-12).

Practical UseExpect hard, real-world-styled tasks to expose gaps; don't trust single-benchmark claims of near-perfect LLM ability.

Evidence RefTable 1

LLM judges make many mistakes on hard math and reasoning.

NumbersLLM-as-judge error rates 21–46% on AMC/AIME/SMC/Zebra (e.g., 38% on AMC12).

Practical UseAvoid using LLM judges for hard verification tasks; prefer ground-truth or human experts for correctness checks.

Evidence RefTable 8 and Table 9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Top LiveBench score (overall)	64.7%	—	—	LiveBench (all tasks)	Table 1 shows o1-preview-2024-09-12 at 64.7%	Table 1
Per-category top scores (examples)	Instruction following 80.1%, Language 68.7%, Coding 67.1% (maxes shown across models)	—	—	LiveBench categories	Table 1 category columns (examples: instruction following, language, coding).	Table 1

What To Try In 7 Days

Run your models on LiveBench to get a contamination-resistant baseline and compare to public leaderboards.

Add a math/reasoning sample from LiveBench when validating releases—these correlate strongly with overall capability.

Avoid relying on LLM-based automatic judges for hard correctness checks; use ground-truth scoring or human experts for verification.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseApache-2.0

Code URLs

https://github.com/LiveBench/LiveBench https://livebench.ai/

Data URLs

https://github.com/LiveBench/LiveBench https://livebench.ai/

Risks & Boundaries

Limitations

Not all tasks are contamination-free; some coding and AMC items have limited residual contamination (A.7).

Ground-truth scoring cannot evaluate open-ended subjective tasks (e.g., travel guides).

When Not To Use

For subjective, preference-driven evaluation (tone, style, creativity).

When you need long-term private test sets without any public release of questions (LiveBench releases private questions only briefly).

Failure Modes

Automated regex scoring may miss unconventional but correct answer formats (authors mitigate with permissive parsing and manual checks).

Residual contamination can still bias some models on lightly modified public problems.

Core Entities

Models

o1-preview-2024-09-12claude-3-5-sonnet-20240620o1-mini-2024-09-12gemini-1.5-pro-002meta-llama-3.1-405b-instructqwen2.5-72b-instructgpt-4o-2024-08-06gpt-4-turbo-2024-04-09phi-3.5-moe-instructmixtral-8x22b-instruct-v0.1

Metrics

LiveBench Score (average across 6 categories)Pass@1 (coding generation/completion)AccuracyF1 (table join prediction)Levenshtein-based ordering score (plot unscramble)LLM judge error rate (ablation)

Datasets

AMC12 2023AIME 2024SMC 2023USAMO/IMO 2024AMPS_Hard (synthetic)LeetCode/AtCoder (LiveCodeBench)Kaggle datasetsSocrata datasetsArXiv abstractsThe Guardian articlesIMDb/Wikipedia plot synopses

Benchmarks

Big-Bench HardIFEvalLiveCodeBenchChatBot ArenaArena-Hard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top models perform below human-like saturation on LiveBench.

LLM judges make many mistakes on hard math and reasoning.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding