A practical survey of how, where and what to test in large language models

Overview

Decision SnapshotReady For Pilot

This survey organizes the evaluation landscape and points to practical gaps (robustness, dynamic tests, trustworthy metrics). It is useful for planning evaluation but does not introduce new evaluation algorithms.

Citations195

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie

Links

Abstract / PDF / Code

Why It Matters For Business

Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This is a wide‑ranging survey of how researchers evaluate large language models (LLMs). It groups evaluations into three questions: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (automatic, human, crowd and adversarial protocols). The paper compiles ~46 benchmarks, catalogs metrics, highlights areas where LLMs do well (generation, many NLP tasks, some QA), and where they fail (complex reasoning, robustness, some multilingual and factual tasks). It argues evaluation itself needs to evolve (dynamic, trustable, behavioral tests) and provides a living GitHub with resources.

Problem Statement

LLMs are widely used but existing evaluation methods are fragmented: different tasks, static benchmarks, and inconsistent metrics leave gaps in judging capability, robustness, safety and societal risk. The paper asks: what should we test, on which datasets, and with which protocols to get fair, useful evaluations.

Main Contribution

A structured review of LLM evaluation across three dimensions: what (tasks), where (datasets/benchmarks), and how (evaluation protocols).

A compiled catalog of popular benchmarks and datasets (Table 7) and a taxonomy of evaluation methods.

Key Findings

No single benchmark or protocol reliably ranks all LLM capabilities.

Numbers46 popular benchmarks compiled (Sec.4, Table 7)

Practical UsePick task‑specific benchmarks and multiple protocols rather than trusting one leaderboard.

Evidence RefSec.4, Table 7

LLMs are strong at many generation and standard NLP tasks (summarization, sentiment, QA, classification).

NumbersChatGPT often >2% higher than GPT-3 on several QA sets (Sec.3.1.3)

Practical UseUse modern LLMs for prototype systems in text generation and QA, but validate with domain tests and human review.

Evidence RefSec.3.1.3

What To Try In 7 Days

Run your core task through two benchmarks: one standard (e.g., MMLU or GLUE) and one domain test.

Do a 1‑day human review of 50 model outputs to check hallucination and safety.

Run prompt robustness tests: perturb prompts and measure Performance Drop Rate (PDR).

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/MLGroupJLU/LLM-eval-survey

Risks & Boundaries

Limitations

Survey covers literature up to mid‑2023; online services evolve quickly.

Does not produce new benchmarks or experimental comparisons of all models.

When Not To Use

When you need a single definitive leaderboard to pick one model for all tasks.

To replace direct, domain-specific testing and human review in regulated domains.

Failure Modes

Dataset leakage and memorization bias when benchmarks become public.

Judge bias when using LLMs themselves as automatic evaluators.

Core Entities

Models

ChatGPTGPT-4GPT-3.5InstructGPTPaLMLLaMAClaudeVicunaBardCodeGenCodeT5

Metrics

AccuracyExact MatchF1ROUGEExpected Calibration Error (ECE)Area Under Curve (AUC)Attack Success Rate (ASR)Performance Drop Rate (PDR)

Datasets

GLUESuperGLUEMMLUMATHNatural QuestionsTriviaQAUSMLEPromptBenchBIG-benchFRESHQA

Benchmarks

HELMMT-BenchChatbot ArenaDynaBenchPromptBenchMMLUBIG-benchMMBenchMMEPandaLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

No single benchmark or protocol reliably ranks all LLM capabilities.

LLMs are strong at many generation and standard NLP tasks (summarization, sentiment, QA, classification).

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding