Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
195
Why It Matters For Business
Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.
Summary TLDR
This is a wide‑ranging survey of how researchers evaluate large language models (LLMs). It groups evaluations into three questions: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (automatic, human, crowd and adversarial protocols). The paper compiles ~46 benchmarks, catalogs metrics, highlights areas where LLMs do well (generation, many NLP tasks, some QA), and where they fail (complex reasoning, robustness, some multilingual and factual tasks). It argues evaluation itself needs to evolve (dynamic, trustable, behavioral tests) and provides a living GitHub with resources.
Problem Statement
LLMs are widely used but existing evaluation methods are fragmented: different tasks, static benchmarks, and inconsistent metrics leave gaps in judging capability, robustness, safety and societal risk. The paper asks: what should we test, on which datasets, and with which protocols to get fair, useful evaluations.
Main Contribution
A structured review of LLM evaluation across three dimensions: what (tasks), where (datasets/benchmarks), and how (evaluation protocols).
A compiled catalog of popular benchmarks and datasets (Table 7) and a taxonomy of evaluation methods.
A synthesis of success and failure cases for LLMs across tasks and a set of grand challenges for future evaluation work.
An open, maintained repository of collected resources: https://github.com/MLGroupJLU/LLM-eval-survey.
Key Findings
No single benchmark or protocol reliably ranks all LLM capabilities.
LLMs are strong at many generation and standard NLP tasks (summarization, sentiment, QA, classification).
LLMs still struggle on complex mathematical and abstract reasoning.
LLMs are vulnerable to adversarial prompts and prompt variants.
Multilingual and non‑Latin script performance remains weaker than English.
Automated metrics save time but do not replace human judgment for open generation, safety and nuanced quality.
Who Should Care
What To Try In 7 Days
Run your core task through two benchmarks: one standard (e.g., MMLU or GLUE) and one domain test.
Do a 1‑day human review of 50 model outputs to check hallucination and safety.
Run prompt robustness tests: perturb prompts and measure Performance Drop Rate (PDR).
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey covers literature up to mid‑2023; online services evolve quickly.
- Does not produce new benchmarks or experimental comparisons of all models.
- Quantitative comparisons are drawn from many heterogeneous studies and datasets.
When Not To Use
- When you need a single definitive leaderboard to pick one model for all tasks.
- To replace direct, domain-specific testing and human review in regulated domains.
Failure Modes
- Dataset leakage and memorization bias when benchmarks become public.
- Judge bias when using LLMs themselves as automatic evaluators.
- Human evaluation variance due to different annotator backgrounds.
Core Entities
Models
- ChatGPT
- GPT-4
- GPT-3.5
- InstructGPT
- PaLM
- LLaMA
- Claude
- Vicuna
- Bard
- CodeGen
- CodeT5
Metrics
- Accuracy
- Exact Match
- F1
- ROUGE
- Expected Calibration Error (ECE)
- Area Under Curve (AUC)
- Attack Success Rate (ASR)
- Performance Drop Rate (PDR)
Datasets
- GLUE
- SuperGLUE
- MMLU
- MATH
- Natural Questions
- TriviaQA
- USMLE
- PromptBench
- BIG-bench
- FRESHQA
Benchmarks
- HELM
- MT-Bench
- Chatbot Arena
- DynaBench
- PromptBench
- MMLU
- BIG-bench
- MMBench
- MME
- PandaLM

