Overview
This survey organizes the evaluation landscape and points to practical gaps (robustness, dynamic tests, trustworthy metrics). It is useful for planning evaluation but does not introduce new evaluation algorithms.
Citations195
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/0
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 40%
Why It Matters For Business
Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.
Who Should Care
Summary TLDR
This is a wide‑ranging survey of how researchers evaluate large language models (LLMs). It groups evaluations into three questions: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (automatic, human, crowd and adversarial protocols). The paper compiles ~46 benchmarks, catalogs metrics, highlights areas where LLMs do well (generation, many NLP tasks, some QA), and where they fail (complex reasoning, robustness, some multilingual and factual tasks). It argues evaluation itself needs to evolve (dynamic, trustable, behavioral tests) and provides a living GitHub with resources.
Problem Statement
LLMs are widely used but existing evaluation methods are fragmented: different tasks, static benchmarks, and inconsistent metrics leave gaps in judging capability, robustness, safety and societal risk. The paper asks: what should we test, on which datasets, and with which protocols to get fair, useful evaluations.
Main Contribution
A structured review of LLM evaluation across three dimensions: what (tasks), where (datasets/benchmarks), and how (evaluation protocols).
A compiled catalog of popular benchmarks and datasets (Table 7) and a taxonomy of evaluation methods.
Key Findings
No single benchmark or protocol reliably ranks all LLM capabilities.
LLMs are strong at many generation and standard NLP tasks (summarization, sentiment, QA, classification).
What To Try In 7 Days
Run your core task through two benchmarks: one standard (e.g., MMLU or GLUE) and one domain test.
Do a 1‑day human review of 50 model outputs to check hallucination and safety.
Run prompt robustness tests: perturb prompts and measure Performance Drop Rate (PDR).
Reproducibility
Risks & Boundaries
Limitations
Survey covers literature up to mid‑2023; online services evolve quickly.
Does not produce new benchmarks or experimental comparisons of all models.
When Not To Use
When you need a single definitive leaderboard to pick one model for all tasks.
To replace direct, domain-specific testing and human review in regulated domains.
Failure Modes
Dataset leakage and memorization bias when benchmarks become public.
Judge bias when using LLMs themselves as automatic evaluators.

