A single-source survey of how we test LLMs: benchmarks, gaps, and practical directions

October 30, 20239 min

Overview

Decision SnapshotNeeds Validation

The survey compiles wide evidence from many benchmarks and papers to recommend broader, dynamic, and safety-aware evaluation; use this map to pick targeted tests for your app.

Citations61

Evidence Strength0.80

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 7/10

Findings with evidence refs: 10/10

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong

Links

Abstract / PDF / Code

Why It Matters For Business

LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.

Who Should Care

Summary TLDR

This 111-page survey organizes LLM evaluation into three big areas—knowledge & capability, alignment, and safety—and catalogs major benchmarks, datasets, evaluation methods, and platforms. It summarizes how we test question answering, reasoning, tool use, bias, toxicity, truthfulness, robustness and agent behavior. It also highlights key weaknesses: static benchmarks that leak into training, fragile judge methods, limited real-world tool & agent tests, and the need for dynamic, risk-aware evaluations.

Problem Statement

We lack a unified, updated practice to measure both capabilities and risks of large language models. Existing benchmarks often focus on narrow tasks, are static (so they leak into training), or ignore safety and agent-style behaviors. This survey maps current evaluations and points out where practitioners should add tests before deploying LLMs.

Main Contribution

A clean taxonomy: knowledge/capability, alignment, safety, specialized domains, and evaluation organization.

A broad catalog of datasets and benchmarks across QA, reasoning, tool-use, toxicity, truthfulness, robustness, and domain tests.

Key Findings

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Numbers100M users in two months

Practical UseHigh adoption raises urgency: production deployments must add safety and truthfulness checks beyond standard accuracy tests.

Evidence RefIntroduction

Tool-augmented robotic planning can achieve high simulated success but still fails at execution.

NumbersPaLM‑SayCan 84% planning / 74% execution (simulated kitchen)

Practical UseMeasure both planning correctness and real-world execution; a high plan pass rate doesn't guarantee successful actuation.

Evidence Ref3.4.1 Tool Manipulation

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
public adoptionChatGPT reached 100M users in two monthsIntroductionChatGPT amassed over 100 million users within two months of its launchIntroduction
robotic planning success (simulated)84% planning successPaLM-SayCan on simulationPaLM-SayCan achieves an 84% planning success rate in simulated kitchen3.4.1 Tool Manipulation

What To Try In 7 Days

Run your core prompts through a small suite: accuracy, toxicity (PerspectiveAPI), and factuality (QAQG) tests.

Add prompt-typo and adversarial-prompt checks to the CI test for critical flows.

Benchmark any tool-integrated flows end-to-end (plan pass rate + execution pass rate).

Agent Features

Memory
short-term context (prompt history)retrieval-augmented knowledge (search/web APIs)
Planning
multi-step planning as separate measure (plan pass rate)Chain-of-Thought and Plan-and-Solve prompting
Tool Use
API callsweb search / browser actionscode execution and runtime toolsdatabase and REST API interactions
Frameworks
ReActToolformerWebGPTPaLM-SayCanToolLLM / API-Bank connectors
Architectures
seq2seq LLMsbrowser-assisted (WebGPT style)tool-augmented LLM connectors
Collaboration
human-in-the-loop preference comparisonsarena-style pairwise preference evaluations (Elo)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes existing literature; it does not run unified new experiments.

Benchmark quality varies; some benchmarks have known pitfalls and noise.

When Not To Use

Don't use this survey as a replacement for domain-specific validation or certification.

Don't assume benchmark averages reflect real-world end-to-end safety.

Failure Modes

Hallucination: fluent but false outputs on domain facts

Benchmark leakage: test data appearing in training data

Core Entities

Models

GPT-3text-davinci-003GPT-3.5ChatGPTGPT-4CodexPaLMPaLM-SayCanVicunaClaudeGPT-NeoXOPTBARTmBARTC-BART

Metrics

Accuracyexact match (EM)F1ROUGE-LBLEUpass rateexecution success ratePerspectiveAPI toxicity scoreElo (arena)log perplexity

Datasets

SQuADSQuAD 2.0HotpotQAGSM8KSVAMPLAMAKoLAWikiFactMMLUC-EvalCMATHTruthfulQARealToxicityPromptsNewsQABIG-benchHumanEvalMBPPGeneTuringWebArena

Benchmarks

GLUESuperGLUEHELMBIG-benchMMLUC-EvalOpenCompassOpenAI EvalsDynabenchToolBenchAPI-BankToolAlpacaRestBench

Context Entities

Models

BERTRoBERTaT5mT5GPT-NeoCodex-family

Metrics

human preference comparisonsself-consistencyQAQG pipeline scoresentailment-based factuality

Datasets

ARCCommonsenseQAPIQAHellaSWAGMultiRCNarrativeQAWikihopHotpotQA (multi-hop)ReClorLogiQALogicInferenceFOLIO

Benchmarks

LongBenchHUMANEVAL+ (EvalPlus)AGGREFACTSummEval