Survey of 23 LLM benchmarks finds widespread blind spots; recommends behavioral profiling and audits

February 15, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

42

Authors

Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, Malka N. Halgamuge

Links

Abstract / PDF

Why It Matters For Business

Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.

Summary TLDR

This paper reviews 23 prominent LLM benchmarks and finds widespread weaknesses in how models are tested. Key problems include sensitivity to prompt formatting, benchmarks that reward memorization rather than reasoning, English centricity and cultural blind spots, slow and inconsistent implementations, and reliance on LLMs to generate evaluations. The authors propose a unified evaluation framework (people, process, technology) and recommend moving from static tests to ongoing behavioral profiling and regular audits to better capture real-world risks.

Problem Statement

Current LLM benchmarks often fail to measure real-world behavior and safety. Benchmarks are often static, English-centric, inconsistent to run, easy to game, and unable to distinguish genuine reasoning from superficial optimization.

Main Contribution

A unified evaluation framework for LLM benchmarks based on People, Process, Technology (PPT), aimed at assessing both functionality and integrity

A systematic critique of 23 state-of-the-art LLM benchmarks, identifying common inadequacies across technological, processual, and human dimensions

A proposal to extend benchmarking with dynamic behavioral profiling and regular post-deployment audits to capture evolving risks and behaviors

Key Findings

Response variability breaks standardized tests

Numbers22/23 benchmarks showed sensitivity

Benchmarks often reward optimization, not reasoning

Numbers22/23 benchmarks flagged for reasoning vs optimization

Helpfulness vs harmlessness is unresolved

Numbers19/23 benchmarks exhibit this tension

Major language and cultural blind spots

Numbers17/23 benchmarks ignore linguistic logic diversity

Installation and scaling are barriers to fair comparison

Numbers16/23 benchmarks hard to deploy or scale

Using LLMs to build or judge benchmarks adds bias

Numbers9/23 benchmarks used model-generated evaluations

Results

Benchmarks with response variability issues

Value22/23

Benchmarks relying on human or mixed evaluation

Value6/23 peer-reviewed at time of writing

Who Should Care

What To Try In 7 Days

Run top candidate models across 5 prompt variants to check sensitivity

Add one unseen or adversarial example per feature to detect memorization

Audit localization by testing critical flows in target user languages

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Authors did not reproduce benchmark results; analysis is literature-based and partly subjective
  • Search and review cut off at Oct 2023; rapidly evolving models may change applicability
  • Language analysis mainly contrasts English and Simplified Chinese; other languages/dialects are underexplored

When Not To Use

  • As a direct pass/fail certification for deployed safety-critical systems
  • To justify a single leaderboard ranking without further robustness checks
  • When you need precise quantitative model-to-model performance claims

Failure Modes

  • Benchmark gaming: models memorize test formats or leaked test data
  • Non-repeatability: vendor updates make results transient
  • Cultural bias: English-centric rubrics misrepresent global users

Core Entities

Models

  • GPT-4
  • ChatGPT
  • GPT-3
  • Codex
  • Flan-PaLM
  • Mistral 8x7B
  • LLaMA

Metrics

  • Accuracy
  • perplexity
  • F1-score
  • ROUGE-L
  • unit-test pass rate

Datasets

  • MedQA
  • MedMCQA
  • PubMedQA
  • Financial PhraseBank
  • FiQA 2018
  • HealthSearchQA

Benchmarks

  • MMLU
  • HumanEval
  • LegalBench
  • FLUE
  • MultiMedQA
  • M3KE
  • T-Bench
  • Chain-of-Thought Hub
  • KoLA
  • SciBench
  • ARB
  • Xiezhi
  • BIG-bench
  • AGIEval
  • ToolAlpaca
  • HELM
  • ToolBench
  • PromptBench
  • AgentBench
  • APIBank
  • C-Eval
  • BOLAA
  • HaluEval