Survey of 23 LLM benchmarks finds widespread blind spots; recommends behavioral profiling and audits

February 15, 20246 min

Overview

Decision SnapshotNeeds Validation

The paper provides a structured critique and clear prevalence counts, but its conclusions are based on a literature review rather than fresh experiments, so apply recommendations cautiously and validate on your own models.

Citations42

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 40%

Authors

Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, Malka N. Halgamuge

Links

Abstract / PDF

Why It Matters For Business

Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.

Who Should Care

Summary TLDR

This paper reviews 23 prominent LLM benchmarks and finds widespread weaknesses in how models are tested. Key problems include sensitivity to prompt formatting, benchmarks that reward memorization rather than reasoning, English centricity and cultural blind spots, slow and inconsistent implementations, and reliance on LLMs to generate evaluations. The authors propose a unified evaluation framework (people, process, technology) and recommend moving from static tests to ongoing behavioral profiling and regular audits to better capture real-world risks.

Problem Statement

Current LLM benchmarks often fail to measure real-world behavior and safety. Benchmarks are often static, English-centric, inconsistent to run, easy to game, and unable to distinguish genuine reasoning from superficial optimization.

Main Contribution

A unified evaluation framework for LLM benchmarks based on People, Process, Technology (PPT), aimed at assessing both functionality and integrity

A systematic critique of 23 state-of-the-art LLM benchmarks, identifying common inadequacies across technological, processual, and human dimensions

Key Findings

Response variability breaks standardized tests

Numbers22/23 benchmarks showed sensitivity

Practical UseDon't trust single-shot benchmark scores; test models under multiple prompts and slight format changes before deployment

Evidence RefSec V-A; Table II

Benchmarks often reward optimization, not reasoning

Numbers22/23 benchmarks flagged for reasoning vs optimization

Practical UseAdd tasks that require on-the-spot reasoning or unseen problem variants to detect memorization and overfitting

Evidence RefSec V-B; Table II

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Benchmarks with response variability issues22/23survey of 23 benchmarksPaper's prevalence counts in Sec V-A and Table IISec V-A; Table II
Benchmarks relying on human or mixed evaluation6/23 peer-reviewed at time of writingsurveyed benchmarksSection IV, Preliminary FindingsSec IV

What To Try In 7 Days

Run top candidate models across 5 prompt variants to check sensitivity

Add one unseen or adversarial example per feature to detect memorization

Audit localization by testing critical flows in target user languages

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Authors did not reproduce benchmark results; analysis is literature-based and partly subjective

Search and review cut off at Oct 2023; rapidly evolving models may change applicability

When Not To Use

As a direct pass/fail certification for deployed safety-critical systems

To justify a single leaderboard ranking without further robustness checks

Failure Modes

Benchmark gaming: models memorize test formats or leaked test data

Non-repeatability: vendor updates make results transient

Core Entities

Models

GPT-4ChatGPTGPT-3CodexFlan-PaLMMistral 8x7BLLaMA

Metrics

AccuracyperplexityF1-scoreROUGE-Lunit-test pass rate

Datasets

MedQAMedMCQAPubMedQAFinancial PhraseBankFiQA 2018HealthSearchQA

Benchmarks

MMLUHumanEvalLegalBenchFLUEMultiMedQAM3KET-BenchChain-of-Thought HubKoLASciBenchARBXiezhiBIG-benchAGIEvalToolAlpacaHELMToolBenchPromptBenchAgentBenchAPIBankC-EvalBOLAAHaluEval