A single-source survey of how we test LLMs: benchmarks, gaps, and practical directions

Overview

Decision SnapshotNeeds Validation

The survey compiles wide evidence from many benchmarks and papers to recommend broader, dynamic, and safety-aware evaluation; use this map to pick targeted tests for your app.

Citations61

Evidence Strength0.80

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 7/10

Findings with evidence refs: 10/10

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong

Links

Abstract / PDF / Code

Why It Matters For Business

LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

This 111-page survey organizes LLM evaluation into three big areas—knowledge & capability, alignment, and safety—and catalogs major benchmarks, datasets, evaluation methods, and platforms. It summarizes how we test question answering, reasoning, tool use, bias, toxicity, truthfulness, robustness and agent behavior. It also highlights key weaknesses: static benchmarks that leak into training, fragile judge methods, limited real-world tool & agent tests, and the need for dynamic, risk-aware evaluations.

Problem Statement

We lack a unified, updated practice to measure both capabilities and risks of large language models. Existing benchmarks often focus on narrow tasks, are static (so they leak into training), or ignore safety and agent-style behaviors. This survey maps current evaluations and points out where practitioners should add tests before deploying LLMs.

Main Contribution

A clean taxonomy: knowledge/capability, alignment, safety, specialized domains, and evaluation organization.

A broad catalog of datasets and benchmarks across QA, reasoning, tool-use, toxicity, truthfulness, robustness, and domain tests.

Key Findings

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Numbers100M users in two months

Practical UseHigh adoption raises urgency: production deployments must add safety and truthfulness checks beyond standard accuracy tests.

Evidence RefIntroduction

Tool-augmented robotic planning can achieve high simulated success but still fails at execution.

NumbersPaLM‑SayCan 84% planning / 74% execution (simulated kitchen)

Practical UseMeasure both planning correctness and real-world execution; a high plan pass rate doesn't guarantee successful actuation.

Evidence Ref3.4.1 Tool Manipulation

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
public adoption	ChatGPT reached 100M users in two months	—	—	Introduction	ChatGPT amassed over 100 million users within two months of its launch	Introduction
robotic planning success (simulated)	84% planning success	—	—	PaLM-SayCan on simulation	PaLM-SayCan achieves an 84% planning success rate in simulated kitchen	3.4.1 Tool Manipulation

What To Try In 7 Days

Run your core prompts through a small suite: accuracy, toxicity (PerspectiveAPI), and factuality (QAQG) tests.

Add prompt-typo and adversarial-prompt checks to the CI test for critical flows.

Benchmark any tool-integrated flows end-to-end (plan pass rate + execution pass rate).

Agent Features

Memory

short-term context (prompt history)retrieval-augmented knowledge (search/web APIs)

Planning

multi-step planning as separate measure (plan pass rate)Chain-of-Thought and Plan-and-Solve prompting

Tool Use

API callsweb search / browser actionscode execution and runtime toolsdatabase and REST API interactions

Frameworks

ReActToolformerWebGPTPaLM-SayCanToolLLM / API-Bank connectors

Architectures

seq2seq LLMsbrowser-assisted (WebGPT style)tool-augmented LLM connectors

Collaboration

human-in-the-loop preference comparisonsarena-style pairwise preference evaluations (Elo)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers

Risks & Boundaries

Limitations

Survey summarizes existing literature; it does not run unified new experiments.

Benchmark quality varies; some benchmarks have known pitfalls and noise.

When Not To Use

Don't use this survey as a replacement for domain-specific validation or certification.

Don't assume benchmark averages reflect real-world end-to-end safety.

Failure Modes

Hallucination: fluent but false outputs on domain facts

Benchmark leakage: test data appearing in training data

Core Entities

Models

GPT-3text-davinci-003GPT-3.5ChatGPTGPT-4CodexPaLMPaLM-SayCanVicunaClaudeGPT-NeoXOPTBARTmBARTC-BART

Metrics

Accuracyexact match (EM)F1ROUGE-LBLEUpass rateexecution success ratePerspectiveAPI toxicity scoreElo (arena)log perplexity

Datasets

SQuADSQuAD 2.0HotpotQAGSM8KSVAMPLAMAKoLAWikiFactMMLUC-EvalCMATHTruthfulQARealToxicityPromptsNewsQABIG-benchHumanEvalMBPPGeneTuringWebArena

Benchmarks

GLUESuperGLUEHELMBIG-benchMMLUC-EvalOpenCompassOpenAI EvalsDynabenchToolBenchAPI-BankToolAlpacaRestBench

Context Entities

Models

BERTRoBERTaT5mT5GPT-NeoCodex-family

Metrics

human preference comparisonsself-consistencyQAQG pipeline scoresentailment-based factuality

Datasets

ARCCommonsenseQAPIQAHellaSWAGMultiRCNarrativeQAWikihopHotpotQA (multi-hop)ReClorLogiQALogicInferenceFOLIO

Benchmarks

LongBenchHUMANEVAL+ (EvalPlus)AGGREFACTSummEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Tool-augmented robotic planning can achieve high simulated success but still fails at execution.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding