Overview
Production Readiness
0.4
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
42
Why It Matters For Business
Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.
Summary TLDR
This paper reviews 23 prominent LLM benchmarks and finds widespread weaknesses in how models are tested. Key problems include sensitivity to prompt formatting, benchmarks that reward memorization rather than reasoning, English centricity and cultural blind spots, slow and inconsistent implementations, and reliance on LLMs to generate evaluations. The authors propose a unified evaluation framework (people, process, technology) and recommend moving from static tests to ongoing behavioral profiling and regular audits to better capture real-world risks.
Problem Statement
Current LLM benchmarks often fail to measure real-world behavior and safety. Benchmarks are often static, English-centric, inconsistent to run, easy to game, and unable to distinguish genuine reasoning from superficial optimization.
Main Contribution
A unified evaluation framework for LLM benchmarks based on People, Process, Technology (PPT), aimed at assessing both functionality and integrity
A systematic critique of 23 state-of-the-art LLM benchmarks, identifying common inadequacies across technological, processual, and human dimensions
A proposal to extend benchmarking with dynamic behavioral profiling and regular post-deployment audits to capture evolving risks and behaviors
Key Findings
Response variability breaks standardized tests
Benchmarks often reward optimization, not reasoning
Helpfulness vs harmlessness is unresolved
Major language and cultural blind spots
Installation and scaling are barriers to fair comparison
Using LLMs to build or judge benchmarks adds bias
Results
Benchmarks with response variability issues
Benchmarks relying on human or mixed evaluation
Who Should Care
What To Try In 7 Days
Run top candidate models across 5 prompt variants to check sensitivity
Add one unseen or adversarial example per feature to detect memorization
Audit localization by testing critical flows in target user languages
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Authors did not reproduce benchmark results; analysis is literature-based and partly subjective
- Search and review cut off at Oct 2023; rapidly evolving models may change applicability
- Language analysis mainly contrasts English and Simplified Chinese; other languages/dialects are underexplored
When Not To Use
- As a direct pass/fail certification for deployed safety-critical systems
- To justify a single leaderboard ranking without further robustness checks
- When you need precise quantitative model-to-model performance claims
Failure Modes
- Benchmark gaming: models memorize test formats or leaked test data
- Non-repeatability: vendor updates make results transient
- Cultural bias: English-centric rubrics misrepresent global users
Core Entities
Models
- GPT-4
- ChatGPT
- GPT-3
- Codex
- Flan-PaLM
- Mistral 8x7B
- LLaMA
Metrics
- Accuracy
- perplexity
- F1-score
- ROUGE-L
- unit-test pass rate
Datasets
- MedQA
- MedMCQA
- PubMedQA
- Financial PhraseBank
- FiQA 2018
- HealthSearchQA
Benchmarks
- MMLU
- HumanEval
- LegalBench
- FLUE
- MultiMedQA
- M3KE
- T-Bench
- Chain-of-Thought Hub
- KoLA
- SciBench
- ARB
- Xiezhi
- BIG-bench
- AGIEval
- ToolAlpaca
- HELM
- ToolBench
- PromptBench
- AgentBench
- APIBank
- C-Eval
- BOLAA
- HaluEval

