Survey of 23 LLM benchmarks finds widespread blind spots; recommends behavioral profiling and audits

Overview

Decision SnapshotNeeds Validation

The paper provides a structured critique and clear prevalence counts, but its conclusions are based on a literature review rather than fresh experiments, so apply recommendations cautiously and validate on your own models.

Citations42

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 40%

Authors

Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, Malka N. Halgamuge

Links

Abstract / PDF

Why It Matters For Business

Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder Engineering Lead

Summary TLDR

This paper reviews 23 prominent LLM benchmarks and finds widespread weaknesses in how models are tested. Key problems include sensitivity to prompt formatting, benchmarks that reward memorization rather than reasoning, English centricity and cultural blind spots, slow and inconsistent implementations, and reliance on LLMs to generate evaluations. The authors propose a unified evaluation framework (people, process, technology) and recommend moving from static tests to ongoing behavioral profiling and regular audits to better capture real-world risks.

Problem Statement

Current LLM benchmarks often fail to measure real-world behavior and safety. Benchmarks are often static, English-centric, inconsistent to run, easy to game, and unable to distinguish genuine reasoning from superficial optimization.

Main Contribution

A unified evaluation framework for LLM benchmarks based on People, Process, Technology (PPT), aimed at assessing both functionality and integrity

A systematic critique of 23 state-of-the-art LLM benchmarks, identifying common inadequacies across technological, processual, and human dimensions

Key Findings

Response variability breaks standardized tests

Numbers22/23 benchmarks showed sensitivity

Practical UseDon't trust single-shot benchmark scores; test models under multiple prompts and slight format changes before deployment

Evidence RefSec V-A; Table II

Benchmarks often reward optimization, not reasoning

Numbers22/23 benchmarks flagged for reasoning vs optimization

Practical UseAdd tasks that require on-the-spot reasoning or unseen problem variants to detect memorization and overfitting

Evidence RefSec V-B; Table II

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Benchmarks with response variability issues	22/23	—	—	survey of 23 benchmarks	Paper's prevalence counts in Sec V-A and Table II	Sec V-A; Table II
Benchmarks relying on human or mixed evaluation	6/23 peer-reviewed at time of writing	—	—	surveyed benchmarks	Section IV, Preliminary Findings	Sec IV

What To Try In 7 Days

Run top candidate models across 5 prompt variants to check sensitivity

Add one unseen or adversarial example per feature to detect memorization

Audit localization by testing critical flows in target user languages

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Authors did not reproduce benchmark results; analysis is literature-based and partly subjective

Search and review cut off at Oct 2023; rapidly evolving models may change applicability

When Not To Use

As a direct pass/fail certification for deployed safety-critical systems

To justify a single leaderboard ranking without further robustness checks

Failure Modes

Benchmark gaming: models memorize test formats or leaked test data

Non-repeatability: vendor updates make results transient

Core Entities

Models

GPT-4ChatGPTGPT-3CodexFlan-PaLMMistral 8x7BLLaMA

Metrics

AccuracyperplexityF1-scoreROUGE-Lunit-test pass rate

Datasets

MedQAMedMCQAPubMedQAFinancial PhraseBankFiQA 2018HealthSearchQA

Benchmarks

MMLUHumanEvalLegalBenchFLUEMultiMedQAM3KET-BenchChain-of-Thought HubKoLASciBenchARBXiezhiBIG-benchAGIEvalToolAlpacaHELMToolBenchPromptBenchAgentBenchAPIBankC-EvalBOLAAHaluEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Response variability breaks standardized tests

Benchmarks often reward optimization, not reasoning

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding