Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
18
Why It Matters For Business
A frequently-updated, ground-truth-scored benchmark prevents inflated claims from contaminated test data and shows real capability gaps—use it to validate model improvements and guard against overfitting to public test sets.
Summary TLDR
LiveBench is a 1,000-question benchmark that updates monthly to limit test-set contamination. It focuses on objectively verifiable tasks across six categories (math, coding, reasoning, language, instruction following, data analysis). Scoring is automatic against ground truth (no LLM judge) and tasks are drawn from recent sources (news, arXiv, competitions). Top models score below 70%, showing sustained difficulty. The authors open-source questions, code, and model outputs and plan to refresh roughly 1/6 of items each month to stay contamination-resistant.
Problem Statement
Test-set contamination and judge bias make many published LLM benchmarks unreliable. The paper builds a frequently-updated benchmark with only objectively scorable questions and automated grading to reduce contamination and judging bias while remaining challenging.
Main Contribution
Design and release of LiveBench: a 1,000-question, contamination-limited benchmark across six task categories.
Automatic, ground-truth scoring (no LLM judges) to avoid judge bias and highlight real capability gaps.
Monthly update policy to replace about one-sixth of questions and keep test data fresh.
Public release of questions, scoring code, and model outputs for reproducibility and community contributions.
Key Findings
Top models perform below human-like saturation on LiveBench.
LLM judges make many mistakes on hard math and reasoning.
LiveBench keeps content fresh by rotating questions.
Math, coding, and reasoning tasks correlate strongly with overall performance.
Results
Top LiveBench score (overall)
Per-category top scores (examples)
LLM judge error rate on hard tasks
Correlation of math task with overall score
Who Should Care
What To Try In 7 Days
Run your models on LiveBench to get a contamination-resistant baseline and compare to public leaderboards.
Add a math/reasoning sample from LiveBench when validating releases—these correlate strongly with overall capability.
Avoid relying on LLM-based automatic judges for hard correctness checks; use ground-truth scoring or human experts for verification.
Reproducibility
License
- Apache-2.0
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Not all tasks are contamination-free; some coding and AMC items have limited residual contamination (A.7).
- Ground-truth scoring cannot evaluate open-ended subjective tasks (e.g., travel guides).
- Benchmark currently focuses on English; non-English coverage is planned but absent now.
- Monthly maintenance requires sustained effort and compute from maintainers.
When Not To Use
- For subjective, preference-driven evaluation (tone, style, creativity).
- When you need long-term private test sets without any public release of questions (LiveBench releases private questions only briefly).
- For non-English model evaluation (LiveBench is English-centric currently).
Failure Modes
- Automated regex scoring may miss unconventional but correct answer formats (authors mitigate with permissive parsing and manual checks).
- Residual contamination can still bias some models on lightly modified public problems.
- LLM judge comparisons can overestimate performance on hard tasks due to judge errors or bias.
Core Entities
Models
- o1-preview-2024-09-12
- claude-3-5-sonnet-20240620
- o1-mini-2024-09-12
- gemini-1.5-pro-002
- meta-llama-3.1-405b-instruct
- qwen2.5-72b-instruct
- gpt-4o-2024-08-06
- gpt-4-turbo-2024-04-09
- phi-3.5-moe-instruct
- mixtral-8x22b-instruct-v0.1
Metrics
- LiveBench Score (average across 6 categories)
- Pass@1 (coding generation/completion)
- Accuracy
- F1 (table join prediction)
- Levenshtein-based ordering score (plot unscramble)
- LLM judge error rate (ablation)
Datasets
- AMC12 2023
- AIME 2024
- SMC 2023
- USAMO/IMO 2024
- AMPS_Hard (synthetic)
- LeetCode/AtCoder (LiveCodeBench)
- Kaggle datasets
- Socrata datasets
- ArXiv abstracts
- The Guardian articles
- IMDb/Wikipedia plot synopses
Benchmarks
- Big-Bench Hard
- IFEval
- LiveCodeBench
- ChatBot Arena
- Arena-Hard

