Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
61
Why It Matters For Business
LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.
Summary TLDR
This 111-page survey organizes LLM evaluation into three big areas—knowledge & capability, alignment, and safety—and catalogs major benchmarks, datasets, evaluation methods, and platforms. It summarizes how we test question answering, reasoning, tool use, bias, toxicity, truthfulness, robustness and agent behavior. It also highlights key weaknesses: static benchmarks that leak into training, fragile judge methods, limited real-world tool & agent tests, and the need for dynamic, risk-aware evaluations.
Problem Statement
We lack a unified, updated practice to measure both capabilities and risks of large language models. Existing benchmarks often focus on narrow tasks, are static (so they leak into training), or ignore safety and agent-style behaviors. This survey maps current evaluations and points out where practitioners should add tests before deploying LLMs.
Main Contribution
A clean taxonomy: knowledge/capability, alignment, safety, specialized domains, and evaluation organization.
A broad catalog of datasets and benchmarks across QA, reasoning, tool-use, toxicity, truthfulness, robustness, and domain tests.
A review of evaluation methods (automatic, human, LLM-as-judge) and a discussion of blind spots and future directions (dynamic evaluation, agent/risk testing).
Key Findings
Public adoption exploded: ChatGPT reached 100 million users within two months of launch.
Tool-augmented robotic planning can achieve high simulated success but still fails at execution.
Many existing LLM tool-invocation benchmarks show open-source models can match or beat GPT-4 on some multi-tool tasks.
Mathematical ability varies by model and prompting; GPT-4 reaches strong but imperfect scores on grade-level math.
Benchmarks and diagnostic datasets can be unreliable or noisy.
Static benchmark contamination is real and undermines evaluation validity.
Agent benchmarks show current autonomous LLM agents still struggle on realistic tasks.
Simple prompt/noise perturbations and translations can dramatically degrade performance.
Human evaluation remains necessary but costly and subjective.
Age-related sentiment bias is measurable and large in sentiment models.
Results
public adoption
robotic planning success (simulated)
robotic execution success (simulated)
open-source multi-tool performance
agent success on WebArena
Who Should Care
What To Try In 7 Days
Run your core prompts through a small suite: accuracy, toxicity (PerspectiveAPI), and factuality (QAQG) tests.
Add prompt-typo and adversarial-prompt checks to the CI test for critical flows.
Benchmark any tool-integrated flows end-to-end (plan pass rate + execution pass rate).
Agent Features
Memory
- short-term context (prompt history)
- retrieval-augmented knowledge (search/web APIs)
Planning
- multi-step planning as separate measure (plan pass rate)
- Chain-of-Thought and Plan-and-Solve prompting
Tool Use
- API calls
- web search / browser actions
- code execution and runtime tools
- database and REST API interactions
Frameworks
- ReAct
- Toolformer
- WebGPT
- PaLM-SayCan
- ToolLLM / API-Bank connectors
Architectures
- seq2seq LLMs
- browser-assisted (WebGPT style)
- tool-augmented LLM connectors
Collaboration
- human-in-the-loop preference comparisons
- arena-style pairwise preference evaluations (Elo)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey summarizes existing literature; it does not run unified new experiments.
- Benchmark quality varies; some benchmarks have known pitfalls and noise.
- Human evaluation recommendations are expensive to scale.
- Some areas (long-term agentic risk, dynamic online evaluation) are nascent and lack standardized datasets.
When Not To Use
- Don't use this survey as a replacement for domain-specific validation or certification.
- Don't assume benchmark averages reflect real-world end-to-end safety.
- Don't rely on a single automatic metric to certify truthfulness or fairness.
Failure Modes
- Hallucination: fluent but false outputs on domain facts
- Benchmark leakage: test data appearing in training data
- Judge bias: LLM-as-evaluator reflects its own biases
- Overfitting to benchmark quirks rather than robust reasoning
- Multilingual blind spots and poor low-resource coverage
- Tool invocation hallucination when calling APIs
Core Entities
Models
- GPT-3
- text-davinci-003
- GPT-3.5
- ChatGPT
- GPT-4
- Codex
- PaLM
- PaLM-SayCan
- Vicuna
- Claude
- GPT-NeoX
- OPT
- BART
- mBART
- C-BART
Metrics
- Accuracy
- exact match (EM)
- F1
- ROUGE-L
- BLEU
- pass rate
- execution success rate
- PerspectiveAPI toxicity score
- Elo (arena)
- log perplexity
Datasets
- SQuAD
- SQuAD 2.0
- HotpotQA
- GSM8K
- SVAMP
- LAMA
- KoLA
- WikiFact
- MMLU
- C-Eval
- CMATH
- TruthfulQA
- RealToxicityPrompts
- NewsQA
- BIG-bench
- HumanEval
- MBPP
- GeneTuring
- WebArena
Benchmarks
- GLUE
- SuperGLUE
- HELM
- BIG-bench
- MMLU
- C-Eval
- OpenCompass
- OpenAI Evals
- Dynabench
- ToolBench
- API-Bank
- ToolAlpaca
- RestBench
Context Entities
Models
- BERT
- RoBERTa
- T5
- mT5
- GPT-Neo
- Codex-family
Metrics
- human preference comparisons
- self-consistency
- QAQG pipeline scores
- entailment-based factuality
Datasets
- ARC
- CommonsenseQA
- PIQA
- HellaSWAG
- MultiRC
- NarrativeQA
- Wikihop
- HotpotQA (multi-hop)
- ReClor
- LogiQA
- LogicInference
- FOLIO
Benchmarks
- LongBench
- HUMANEVAL+ (EvalPlus)
- AGGREFACT
- SummEval

