A single-source survey of how we test LLMs: benchmarks, gaps, and practical directions

October 30, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

61

Authors

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong

Links

Abstract / PDF

Why It Matters For Business

LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.

Summary TLDR

This 111-page survey organizes LLM evaluation into three big areas—knowledge & capability, alignment, and safety—and catalogs major benchmarks, datasets, evaluation methods, and platforms. It summarizes how we test question answering, reasoning, tool use, bias, toxicity, truthfulness, robustness and agent behavior. It also highlights key weaknesses: static benchmarks that leak into training, fragile judge methods, limited real-world tool & agent tests, and the need for dynamic, risk-aware evaluations.

Problem Statement

We lack a unified, updated practice to measure both capabilities and risks of large language models. Existing benchmarks often focus on narrow tasks, are static (so they leak into training), or ignore safety and agent-style behaviors. This survey maps current evaluations and points out where practitioners should add tests before deploying LLMs.

Main Contribution

A clean taxonomy: knowledge/capability, alignment, safety, specialized domains, and evaluation organization.

A broad catalog of datasets and benchmarks across QA, reasoning, tool-use, toxicity, truthfulness, robustness, and domain tests.

A review of evaluation methods (automatic, human, LLM-as-judge) and a discussion of blind spots and future directions (dynamic evaluation, agent/risk testing).

Key Findings

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Numbers100M users in two months

Tool-augmented robotic planning can achieve high simulated success but still fails at execution.

NumbersPaLM‑SayCan 84% planning / 74% execution (simulated kitchen)

Many existing LLM tool-invocation benchmarks show open-source models can match or beat GPT-4 on some multi-tool tasks.

NumbersOpen-source ≥ GPT‑4 on 4/8 ToolBench tasks

Mathematical ability varies by model and prompting; GPT-4 reaches strong but imperfect scores on grade-level math.

NumbersGPT‑4 >60% accuracy across grades (CMATH)

Benchmarks and diagnostic datasets can be unreliable or noisy.

NumbersWinoBias/WinoGender tests only 0%–58% unaffected by dataset pitfalls

Static benchmark contamination is real and undermines evaluation validity.

Agent benchmarks show current autonomous LLM agents still struggle on realistic tasks.

NumbersWebArena GPT‑4 agent success 10.59%

Simple prompt/noise perturbations and translations can dramatically degrade performance.

Human evaluation remains necessary but costly and subjective.

Age-related sentiment bias is measurable and large in sentiment models.

Numbers"young" 66% more likely to be rated positive vs "old"

Results

public adoption

ValueChatGPT reached 100M users in two months

robotic planning success (simulated)

Value84% planning success

robotic execution success (simulated)

Value74% execution success

open-source multi-tool performance

Valueopen-source models >= GPT-4 on 4 out of 8 ToolBench tasks

BaselineGPT-4

agent success on WebArena

Value10.59% best GPT-4 agent success

Who Should Care

What To Try In 7 Days

Run your core prompts through a small suite: accuracy, toxicity (PerspectiveAPI), and factuality (QAQG) tests.

Add prompt-typo and adversarial-prompt checks to the CI test for critical flows.

Benchmark any tool-integrated flows end-to-end (plan pass rate + execution pass rate).

Agent Features

Memory

  • short-term context (prompt history)
  • retrieval-augmented knowledge (search/web APIs)

Planning

  • multi-step planning as separate measure (plan pass rate)
  • Chain-of-Thought and Plan-and-Solve prompting

Tool Use

  • API calls
  • web search / browser actions
  • code execution and runtime tools
  • database and REST API interactions

Frameworks

  • ReAct
  • Toolformer
  • WebGPT
  • PaLM-SayCan
  • ToolLLM / API-Bank connectors

Architectures

  • seq2seq LLMs
  • browser-assisted (WebGPT style)
  • tool-augmented LLM connectors

Collaboration

  • human-in-the-loop preference comparisons
  • arena-style pairwise preference evaluations (Elo)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey summarizes existing literature; it does not run unified new experiments.
  • Benchmark quality varies; some benchmarks have known pitfalls and noise.
  • Human evaluation recommendations are expensive to scale.
  • Some areas (long-term agentic risk, dynamic online evaluation) are nascent and lack standardized datasets.

When Not To Use

  • Don't use this survey as a replacement for domain-specific validation or certification.
  • Don't assume benchmark averages reflect real-world end-to-end safety.
  • Don't rely on a single automatic metric to certify truthfulness or fairness.

Failure Modes

  • Hallucination: fluent but false outputs on domain facts
  • Benchmark leakage: test data appearing in training data
  • Judge bias: LLM-as-evaluator reflects its own biases
  • Overfitting to benchmark quirks rather than robust reasoning
  • Multilingual blind spots and poor low-resource coverage
  • Tool invocation hallucination when calling APIs

Core Entities

Models

  • GPT-3
  • text-davinci-003
  • GPT-3.5
  • ChatGPT
  • GPT-4
  • Codex
  • PaLM
  • PaLM-SayCan
  • Vicuna
  • Claude
  • GPT-NeoX
  • OPT
  • BART
  • mBART
  • C-BART

Metrics

  • Accuracy
  • exact match (EM)
  • F1
  • ROUGE-L
  • BLEU
  • pass rate
  • execution success rate
  • PerspectiveAPI toxicity score
  • Elo (arena)
  • log perplexity

Datasets

  • SQuAD
  • SQuAD 2.0
  • HotpotQA
  • GSM8K
  • SVAMP
  • LAMA
  • KoLA
  • WikiFact
  • MMLU
  • C-Eval
  • CMATH
  • TruthfulQA
  • RealToxicityPrompts
  • NewsQA
  • BIG-bench
  • HumanEval
  • MBPP
  • GeneTuring
  • WebArena

Benchmarks

  • GLUE
  • SuperGLUE
  • HELM
  • BIG-bench
  • MMLU
  • C-Eval
  • OpenCompass
  • OpenAI Evals
  • Dynabench
  • ToolBench
  • API-Bank
  • ToolAlpaca
  • RestBench

Context Entities

Models

  • BERT
  • RoBERTa
  • T5
  • mT5
  • GPT-Neo
  • Codex-family

Metrics

  • human preference comparisons
  • self-consistency
  • QAQG pipeline scores
  • entailment-based factuality

Datasets

  • ARC
  • CommonsenseQA
  • PIQA
  • HellaSWAG
  • MultiRC
  • NarrativeQA
  • Wikihop
  • HotpotQA (multi-hop)
  • ReClor
  • LogiQA
  • LogicInference
  • FOLIO

Benchmarks

  • LongBench
  • HUMANEVAL+ (EvalPlus)
  • AGGREFACT
  • SummEval