Overview
LiveBench is production-ready for benchmarking and longitudinal tracking; it is novel in combining live question sources with ground-truth scoring and monthly refreshes.
Citations18
Evidence Strength0.90
Confidence0.88
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
License: Apache-2.0
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
A frequently-updated, ground-truth-scored benchmark prevents inflated claims from contaminated test data and shows real capability gaps—use it to validate model improvements and guard against overfitting to public test sets.
Who Should Care
Summary TLDR
LiveBench is a 1,000-question benchmark that updates monthly to limit test-set contamination. It focuses on objectively verifiable tasks across six categories (math, coding, reasoning, language, instruction following, data analysis). Scoring is automatic against ground truth (no LLM judge) and tasks are drawn from recent sources (news, arXiv, competitions). Top models score below 70%, showing sustained difficulty. The authors open-source questions, code, and model outputs and plan to refresh roughly 1/6 of items each month to stay contamination-resistant.
Problem Statement
Test-set contamination and judge bias make many published LLM benchmarks unreliable. The paper builds a frequently-updated benchmark with only objectively scorable questions and automated grading to reduce contamination and judging bias while remaining challenging.
Main Contribution
Design and release of LiveBench: a 1,000-question, contamination-limited benchmark across six task categories.
Automatic, ground-truth scoring (no LLM judges) to avoid judge bias and highlight real capability gaps.
Key Findings
Top models perform below human-like saturation on LiveBench.
LLM judges make many mistakes on hard math and reasoning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Top LiveBench score (overall) | 64.7% | — | — | LiveBench (all tasks) | Table 1 shows o1-preview-2024-09-12 at 64.7% | Table 1 |
| Per-category top scores (examples) | Instruction following 80.1%, Language 68.7%, Coding 67.1% (maxes shown across models) | — | — | LiveBench categories | Table 1 category columns (examples: instruction following, language, coding). | Table 1 |
What To Try In 7 Days
Run your models on LiveBench to get a contamination-resistant baseline and compare to public leaderboards.
Add a math/reasoning sample from LiveBench when validating releases—these correlate strongly with overall capability.
Avoid relying on LLM-based automatic judges for hard correctness checks; use ground-truth scoring or human experts for verification.
Reproducibility
Risks & Boundaries
Limitations
Not all tasks are contamination-free; some coding and AMC items have limited residual contamination (A.7).
Ground-truth scoring cannot evaluate open-ended subjective tasks (e.g., travel guides).
When Not To Use
For subjective, preference-driven evaluation (tone, style, creativity).
When you need long-term private test sets without any public release of questions (LiveBench releases private questions only briefly).
Failure Modes
Automated regex scoring may miss unconventional but correct answer formats (authors mitigate with permissive parsing and manual checks).
Residual contamination can still bias some models on lightly modified public problems.

