LLM-as-a-Judge Papers — Parsed & Scored for Practitioners

Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors

0.70

0.60

0.70

433

High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.

Key finding

GPT-4 judgments align with human experts on non-tied MT-bench votes.

Numbers: 85% agreement (MT-bench non-tie, Table 5)

Ragas: reference-free checks for RAG faithfulness, relevance, and context focus

0.60

0.50

65

Provides fast, automated checks to catch ungrounded answers and noisy retrieval, reducing time spent on manual labeling and lowering hallucination risk in RAG deployments.

Key finding

Ragas matches human judgements on faithfulness with very high accuracy.

Numbers: Faithfulness accuracy 0.95 on WikiEval (Table 1)

A practical benchmark and playbook showing LLMs can speed social-science labeling and generate useful explanations — but not fully replace专家

0.60

0.50

0.60

61

LLMs can cut labeling time by producing reasonable draft labels and high-quality draft explanations; use them to scale annotation and speed exploratory analysis while keeping humans in the loop to validate and correct outputs.

Key finding

Zero-shot LLMs sometimes match or exceed human agreement on specific classification tasks.

Numbers: Misinfo F1=77.4, κ=0.55 vs human κ=0.51

Instruction finetuning small open LLMs (Alpaca, FLAN-T5) boosts mental-health prediction to match or beat much larger models

0.25

0.60

0.55

59

Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.

Key finding

Instruction finetuning markedly improves performance over prompting.

Numbers: Alpaca finetuned: +23.4% balanced accuracy vs Alpaca zero-shot

ChatGPT can judge summary factuality zero‑shot but shows lexical bias, false reasoning, and prompt sensitivity

0.60

0.40

0.70

50

ChatGPT offers a ready-to-use, zero-shot factuality evaluator that can reduce annotation and training costs and often aligns better with human judgments, but it needs calibration for paraphrase-heavy or domain-specific text.

Key finding

ChatGPT (zero-shot + CoT) often matches or beats prior factuality metrics on multiple benchmarks.

Numbers: CoGenSum BA 74.3% vs SummaC ZS 70.4%; SummEval 83.3% vs 78.7% (Table 2)

Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

0.70

0.35

0.45

44

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Key finding

High in-policy classification performance on internal test set.

Numbers: AUPRC prompt=0.945; response=0.953 (Table 2)

Large blinded study: LLMs' ideas judged more novel than experts but slightly less feasible

0.40

0.70

0.40

41

LLM ideation can quickly surface novel research directions, but outputs need human vetting for feasibility and implementation; blindly trusting LLM ideas or LLM-only evaluation risks wasted effort.

Key finding

AI-generated ideas were rated more novel than human experts.

Numbers: Novelty: Human 4.84 vs AI 5.64 (1–10 scale); p<0.01 (Test 1)

AnnoLLM: have GPT‑3.5 explain examples, then use those explanations as few‑shot prompts to label data

0.60

0.50

0.70

34

You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.

Key finding

AnnoLLM outperforms crowdsourced annotators on the QK task.

Numbers: 75.60% (AnnoLLM test) vs 71.5% (crowd)

Detect hallucinated facts from any black‑box LLM by sampling its own alternative outputs

0.60

0.70

0.45

33

You can flag likely false claims from closed-source LLMs without buying or building knowledge bases; this reduces misinformation risk in customer-facing text generation.

Key finding

Prompt-based SelfCheckGPT achieved the strongest results at both sentence and passage levels.

Numbers: Sentence AUC-PR (NonFact)=93.42; Passage Pearson=78.32 (Table 2)

LLMs (text‑davinci‑003, ChatGPT) can mimic expert human ratings on story quality and adversarial text, cheaply and reproducibly.

0.60

0.40

0.80

31

LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.

Key finding

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

Numbers: Kendall's τ up to 0.38 (relevance) vs teachers

Practical review: how large language models can help — and where they fall short — in language teaching and automated assessment

0.40

0.35

0.60

31

LLMs let EdTech scale content creation and interactive features quickly, but they add compute cost and require human oversight to avoid quality, bias and calibration problems.

Key finding

LLMs produce better open-ended text generation than prior small models, enabling plausible content generation for reading and chat practice.

LLM graders prefer an answer's position — simple calibration and a little human help fix it

0.60

0.45

0.60

29

If you auto-grade or compare models with LLMs, order effects can flip results and mislead decisions; applying MEC+BPC and targeted human checks improves reliability and cuts annotation cost.

Key finding

LLM evaluators frequently conflict when candidate order is swapped.

Numbers: GPT-4 conflict rate 46.3% (Vicuna vs ChatGPT); ChatGPT 82.5% (Table 2)

Train a lightweight judge-model (PandaLM) to pick better hyperparameters for instruction-tuned LLMs, reducing human/API cost while matching/

0.60

0.55

0.70

28

PandaLM reduces the cost and privacy risk of hyperparameter tuning by replacing paid API or large-scale human evaluation with a runnable judge model that selects better tuning settings.

Key finding

PandaLM-70B matches or slightly exceeds GPT-4 on a human-aligned test set.

Numbers: PandaLM-70B accuracy 0.6687 vs GPT-4 0.6647 (Table 2)

Do LLMs write like a personality? GPT-3.5 and GPT-4 can be prompted to express Big Five traits and people often recognize them

0.40

0.60

0.30

24

Prompted LLMs can reliably take on personality-like profiles and produce believable narratives; this matters for products that personalize voice, human simulation, or content moderation and suggests disclosure policies are needed.

Key finding

LLM personas' self-reported BFI scores match their prompted traits with very large effects.

Numbers: GPT-4 Cohen's d: EXT 5.47; AGR 4.22; CON 4.39; NEU 5.17; OPN 6.30 (p<.001)

COBBLER shows many LLMs are biased evaluators and disagree with humans

0.30

0.45

0.30

24

Using LLMs as automatic scorers risks amplifying biases and diverging from human judgments, which can corrupt leaderboards, model selection, or downstream data labeling.

Key finding

LLMs show biased evaluation choices in a large fraction of comparisons

Numbers: ≈40% of comparisons across models were labeled biased

ChatGPT/GPT‑4 can directly rank search passages with simple prompts; distilled small models inherit that power.

0.65

0.55

0.60

23

LLMs can directly re-rank search results zero‑shot and produce supervisory labels to train small, cheaper re‑rankers; this can cut inference cost and maintenance versus training large supervised re‑rankers on noisy labels.

Key finding

GPT‑4 outperforms strong supervised re‑rankers on standard benchmarks when using permutation prompts.

Numbers: nDCG@10: GPT‑4 53.68 vs monoT5 (3B) 51.36 on BEIR (avg), delta +2.32

SuperCLUE: open + closed Chinese tests plus a user arena to predict what real users prefer

0.70

0.50

0.60

22

Closed multiple-choice scores do not guarantee user satisfaction; combine closed tests with multi-turn open evaluations and use an LLM judge like GPT-4 to estimate real-user preference faster and cheaper.

Key finding

CArena contains 9.9k real user votes used as the gold standard for user preference.

Numbers: 9.9k votes (Section 3, CArena)

A practical survey of using LLMs as automated evaluators, covering methods, apps, benchmarks, and risks

0.70

0.40

0.80

21

LLM judges let teams scale evaluation and feedback in minutes, reduce human labeling cost, and produce human-readable explanations that speed iteration.

Key finding

LLMs can match or exceed crowd annotators on some annotation tasks.

Numbers: GPT-4 83.6% vs MTurk 81.5% (annotation accuracy)

BigToM: a 5,000-item, model‑written benchmark that tests Theory-of‑Mind with causal templates

0.60

0.65

0.50

20

If you deploy LLMs to reason about human intentions, use controlled ToM checks: GPT‑4 often matches human patterns but is unreliable on harder inferences, and other models usually perform worse.

Key finding

Model-written benchmark (BigToM) is large and well-rated by humans.

Numbers: 5,000 items; expert structure-agreement 93.94%; expert mean quality ≈4.34/5

Practical survey of LLM evaluation metrics, statistical meaning, and biomedical examples

0.60

0.30

0.50

19

Choosing the right metrics avoids misleading conclusions about model quality and reduces costly deployment mistakes; adding uncertainty and bias checks makes model comparisons actionable and safer.

Key finding

Most LLM papers rely heavily on Multiple-Classification (MC) metrics like accuracy, precision, recall and F1.

ChatGPT can score generated text without references — explicit numeric scores work best; pairwise comparisons often underperform.

0.60

0.40

0.50

19

You can use ChatGPT to score generated text without references and get evaluations closer to human judgments than many automatic metrics, which speeds up model iteration and reduces reliance on hand-built references.

Key finding

ChatGPT's Explicit Score aligns with human judgments better than many automatic metrics on multiple tasks.

Numbers: SummEval (coherence) Spearman: ChatGPT (greedy) 52.2 vs BARTScore 33.4 (Table 1).

Fine-tuned open-source LLMs can act as fast, accurate judges for other LLMs

0.70

0.40

0.60

18

JudgeLM lets teams run fast, reproducible, and local automatic evaluations instead of slow human/API judging; this lowers cost and speeds model iteration while keeping judgments consistent.

Key finding

Large fine-tuned JudgeLM reaches near-GPT-4 agreement on the authors' benchmark.

Numbers: Agreement 90.06% (JudgeLM-33B, 100K finetune)

LiveBench: a live, ground-truth-scored benchmark that resists test-set contamination

0.70

0.60

0.70

18

A frequently-updated, ground-truth-scored benchmark prevents inflated claims from contaminated test data and shows real capability gaps—use it to validate model improvements and guard against overfitting to public test sets.

Key finding

Top models perform below human-like saturation on LiveBench.

Numbers: Top LiveBench score 64.7% (o1-preview-2024-09-12).

GAOKAO-Bench: using China’s college exam (2010–2022) to test LLMs on real exam questions

0.60

0.50

0.60

18

GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.

Key finding

GPT-4 attains strong exam performance but below full marks.

Numbers: Converted totals: sciences 434, humanities 480 (GPT-4-0613).