A practical survey of how, where and what to test in large language models

July 6, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

195

Authors

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie

Links

Abstract / PDF

Why It Matters For Business

Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.

Summary TLDR

This is a wide‑ranging survey of how researchers evaluate large language models (LLMs). It groups evaluations into three questions: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (automatic, human, crowd and adversarial protocols). The paper compiles ~46 benchmarks, catalogs metrics, highlights areas where LLMs do well (generation, many NLP tasks, some QA), and where they fail (complex reasoning, robustness, some multilingual and factual tasks). It argues evaluation itself needs to evolve (dynamic, trustable, behavioral tests) and provides a living GitHub with resources.

Problem Statement

LLMs are widely used but existing evaluation methods are fragmented: different tasks, static benchmarks, and inconsistent metrics leave gaps in judging capability, robustness, safety and societal risk. The paper asks: what should we test, on which datasets, and with which protocols to get fair, useful evaluations.

Main Contribution

A structured review of LLM evaluation across three dimensions: what (tasks), where (datasets/benchmarks), and how (evaluation protocols).

A compiled catalog of popular benchmarks and datasets (Table 7) and a taxonomy of evaluation methods.

A synthesis of success and failure cases for LLMs across tasks and a set of grand challenges for future evaluation work.

An open, maintained repository of collected resources: https://github.com/MLGroupJLU/LLM-eval-survey.

Key Findings

No single benchmark or protocol reliably ranks all LLM capabilities.

Numbers46 popular benchmarks compiled (Sec.4, Table 7)

LLMs are strong at many generation and standard NLP tasks (summarization, sentiment, QA, classification).

NumbersChatGPT often >2% higher than GPT-3 on several QA sets (Sec.3.1.3)

LLMs still struggle on complex mathematical and abstract reasoning.

NumbersGPT-4 reaches ~60% on some high-school competition categories but much lower on hardest tasks (Sec.3.4.1)

LLMs are vulnerable to adversarial prompts and prompt variants.

Multilingual and non‑Latin script performance remains weaker than English.

Automated metrics save time but do not replace human judgment for open generation, safety and nuanced quality.

NumbersHuman evaluation still widely used across benchmarks and for QA/safety (Sec.5.2)

Who Should Care

What To Try In 7 Days

Run your core task through two benchmarks: one standard (e.g., MMLU or GLUE) and one domain test.

Do a 1‑day human review of 50 model outputs to check hallucination and safety.

Run prompt robustness tests: perturb prompts and measure Performance Drop Rate (PDR).

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey covers literature up to mid‑2023; online services evolve quickly.
  • Does not produce new benchmarks or experimental comparisons of all models.
  • Quantitative comparisons are drawn from many heterogeneous studies and datasets.

When Not To Use

  • When you need a single definitive leaderboard to pick one model for all tasks.
  • To replace direct, domain-specific testing and human review in regulated domains.

Failure Modes

  • Dataset leakage and memorization bias when benchmarks become public.
  • Judge bias when using LLMs themselves as automatic evaluators.
  • Human evaluation variance due to different annotator backgrounds.

Core Entities

Models

  • ChatGPT
  • GPT-4
  • GPT-3.5
  • InstructGPT
  • PaLM
  • LLaMA
  • Claude
  • Vicuna
  • Bard
  • CodeGen
  • CodeT5

Metrics

  • Accuracy
  • Exact Match
  • F1
  • ROUGE
  • Expected Calibration Error (ECE)
  • Area Under Curve (AUC)
  • Attack Success Rate (ASR)
  • Performance Drop Rate (PDR)

Datasets

  • GLUE
  • SuperGLUE
  • MMLU
  • MATH
  • Natural Questions
  • TriviaQA
  • USMLE
  • PromptBench
  • BIG-bench
  • FRESHQA

Benchmarks

  • HELM
  • MT-Bench
  • Chatbot Arena
  • DynaBench
  • PromptBench
  • MMLU
  • BIG-bench
  • MMBench
  • MME
  • PandaLM