Human Evaluation Papers — Parsed & Scored for Practitioners

Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

0.70

0.30

0.60

2,595

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Key finding

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers: 2.0T tokens; sizes 7B,13B,34B,70B

Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors

0.70

0.60

0.70

433

High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.

Key finding

GPT-4 judgments align with human experts on non-tied MT-bench votes.

Numbers: 85% agreement (MT-bench non-tie, Table 5)

ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

0.60

0.20

0.60

313

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Key finding

Prompt wording matters but has only modest effect.

Numbers: Best prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

A practical survey of how, where and what to test in large language models

0.70

0.40

0.60

195

Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.

Key finding

No single benchmark or protocol reliably ranks all LLM capabilities.

Numbers: 46 popular benchmarks compiled (Sec.4, Table 7)

DeepSeek: scaling recipes and a 2T‑token bilingual pretraining run that yields 7B and 67B models competitive on code, math, and chat

0.70

0.60

0.70

82

The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.

Key finding

Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.

Numbers: near‑optimal region defined as ≤0.25% above min loss; fitted across 1e17–2e19 FLOPs

Instruction tuning, not model size, drives LLM zero-shot news summarization; benchmark references are often worse than generated summaries.

0.60

0.35

0.40

64

If you want usable zero-shot news summaries, use instruction-tuned LLMs rather than largest-parameter models; validate or replace public benchmark references before trusting automatic metrics.

Key finding

Instruction tuning yields much stronger zero-shot summarization than model scale.

Numbers: Zero-shot Instruct Davinci faithfulness 0.99 vs GPT-3 175B faithfulness 0.76 on CNN/DM (Table 2)

A single-source survey of how we test LLMs: benchmarks, gaps, and practical directions

0.60

0.40

0.60

61

LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.

Key finding

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Numbers: 100M users in two months

A practical benchmark and playbook showing LLMs can speed social-science labeling and generate useful explanations — but not fully replace专家

0.60

0.50

0.60

61

LLMs can cut labeling time by producing reasonable draft labels and high-quality draft explanations; use them to scale annotation and speed exploratory analysis while keeping humans in the loop to validate and correct outputs.

Key finding

Zero-shot LLMs sometimes match or exceed human agreement on specific classification tasks.

Numbers: Misinfo F1=77.4, κ=0.55 vs human κ=0.51

LACA: use GPT-3.5 to speed deductive qualitative coding while checking reliability

0.60

0.50

0.70

61

LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.

Key finding

GPT-3.5 often matches human agreement on many coding tasks.

Numbers: Human-model Gwet's AC1 frequently ≥0.76; examples MAGA 0.98, MEDI 0.96

ChatGPT is weak at standard supervised IE but surprisingly strong at open extraction, explains itself well, yet is overconfident

0.50

0.40

59

ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.

Key finding

ChatGPT underperforms supervised baselines on Standard-IE tasks.

Numbers: Standard-IE full-test Micro-F1: ChatGPT often << SOTA (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56).

Appending short emotional phrases to prompts measurably improves LLM outputs

0.60

0.50

0.40

57

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Key finding

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers: 8.00% relative improvement on Instruction Induction (Table 1)

Practical review of data, training, and evaluation methods to align LLMs with human preferences

0.60

0.40

0.70

54

Aligning LLMs reduces risky outputs and increases usefulness; using parameter-efficient tuning cuts compute costs and enables faster iteration.

Key finding

Small sets of high-quality instructions can suffice to produce alignment effects.

Numbers: LLaMA needs ~8K instructions (IFS); other work reports ~6K high-quality instructions

Large blinded study: LLMs' ideas judged more novel than experts but slightly less feasible

0.40

0.70

0.40

41

LLM ideation can quickly surface novel research directions, but outputs need human vetting for feasibility and implementation; blindly trusting LLM ideas or LLM-only evaluation risks wasted effort.

Key finding

AI-generated ideas were rated more novel than human experts.

Numbers: Novelty: Human 4.84 vs AI 5.64 (1–10 scale); p<0.01 (Test 1)

Systematic benchmark: GPT-series and LLaMA variants vs. fine-tuned BioNLP models across 12 biomedical tasks

0.60

0.40

0.70

41

If you need high-accuracy extraction or classification in biomedical text, fine-tuned domain models remain the practical choice; use GPT-4 for reasoning or prototyping high-level QA but budget for much higher inference costs and add output validation.

Key finding

Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.

Numbers: Macro-average: SOTA fine-tuned 0.6536 vs. best LLM zero/few-shot ~0.51

AnnoLLM: have GPT‑3.5 explain examples, then use those explanations as few‑shot prompts to label data

0.60

0.50

0.70

34

You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.

Key finding

AnnoLLM outperforms crowdsourced annotators on the QK task.

Numbers: 75.60% (AnnoLLM test) vs 71.5% (crowd)

Psy-LLM: fine-tuned Chinese LLM for scalable online mental-health Q&A

0.40

0.35

0.60

34

An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.

Key finding

PanGu 350M produced lower perplexity than WenZhong on the evaluation data.

Numbers: Perplexity: PanGu 34.56 vs WenZhong 38.40

Practical survey of methods, attacks, and evaluations for aligning large language models

0.45

0.40

0.50

34

Misaligned LLMs can produce legal, reputational, and safety failures. Alignment methods reduce harmful outputs but need governance, red-teaming, and evaluation to manage adversarial and privacy risks.

Key finding

Reinforcement Learning from Human Feedback (RLHF) is the most common non-recursive oversight method for aligning LLMs.

CMExam: 60K+ Chinese medical multiple-choice questions with explanations and fine-grained annotations

0.50

0.45

0.40

32

CMExam gives a reliable, large-scale way to measure clinical QA performance for Chinese medical LLMs so teams can identify domain gaps and cost-effectively fine-tune small models.

Key finding

GPT-4 is the top zero-shot answer predictor on CMExam.

Numbers: 61.6% accuracy (GPT-4) vs 71.6% (human)

Human judges prefer LLM summaries; reference summaries often contain more hallucinations.

0.70

0.40

0.60

32

Zero-shot LLMs can produce higher-quality, more factual summaries than many human references and fine-tuned models, so businesses can often deploy LLM summarization directly and shift effort to dataset curation and verification.

Key finding

Human judges prefer LLM summaries over human-written and fine-tuned model summaries in pairwise comparisons.

Numbers: Human preference scores for LLMs exceed 50% across tasks (Figure 4).

LLMs (text‑davinci‑003, ChatGPT) can mimic expert human ratings on story quality and adversarial text, cheaply and reproducibly.

0.60

0.40

0.80

31

LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.

Key finding

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

Numbers: Kendall's τ up to 0.38 (relevance) vs teachers

Practical review: how large language models can help — and where they fall short — in language teaching and automated assessment

0.40

0.35

0.60

31

LLMs let EdTech scale content creation and interactive features quickly, but they add compute cost and require human oversight to avoid quality, bias and calibration problems.

Key finding

LLMs produce better open-ended text generation than prior small models, enabling plausible content generation for reading and chat practice.

HaluEval: 35k test cases (human + synthetic) to measure whether LLMs spot made-up facts.

0.70

0.60

0.50

29

Models can produce believable but false facts. That creates risk for customer-facing apps, search, and decision tools. HaluEval lets you measure how often your model fabricates facts and whether it can flag them.

Key finding

ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.

Numbers: 977 of 5,000 annotated responses (19.5%)

LegalBench: 162 lawyer-crafted tasks to test LLM legal reasoning

0.40

0.60

0.50

28

LEGALBENCH gives legal teams and ML practitioners a practical suite to test LLMs on many lawyer-defined tasks before deployment, exposing brittle cases, prompt sensitivity, and task-by-task risk.

Key finding

GPT-4 is the strongest model across most legal reasoning categories in this evaluation.

Numbers: Issue 82.9, Rule-recall 59.2, Conclusion 89.9, Interpretation 75.2, Rhetorical 79.4 (balanced-accuracy, Table 2)

How LLMs are reshaping healthcare: capabilities, data needs, risks, and where to start

0.40

0.30

0.70

28

LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.

Key finding

Top LLMs approach human performance on exam-style medical questions.

Numbers: USMLE: GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%