LLMs (text‑davinci‑003, ChatGPT) can mimic expert human ratings on story quality and adversarial text, cheaply and reproducibly.

Overview

Decision SnapshotNeeds Validation

LLMevaluation is ready as a development-time tool (reproducible, cheap) but not a full replacement for humans in deployment or for factual/emotional tasks.

Citations31

Evidence Strength0.70

Confidence0.80

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 40%

Authors

Cheng-Han Chiang, Hung-yi Lee

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The paper defines "LLMevaluation": give an LLM the exact human evaluation prompt, the sample, and the rating question, then parse the LLM's output as a score. On two tasks (open-ended story scoring and adversarial-text quality) the best LLMs (text-davinci-003 and ChatGPT) rank outputs similarly to expert English teachers, are stable to small instruction and sampling changes, and are far cheaper and faster than hiring experts. The authors note clear limits: LLMs can be biased by safety tuning, lack reliable factual knowledge or emotions, and may refuse or alter judgments on policy-sensitive content. They recommend using LLMevaluation as a fast, reproducible development tool, not a full human替

Problem Statement

Human evaluation is necessary but costly, slow, inconsistent and hard to reproduce. The paper asks: can strong LLMs act as reliable, cheaper substitutes for human judges when scoring text quality?

Main Contribution

Define LLMevaluation: feed LLMs the same instruction, sample, and rating question used in human studies and parse their free-text replies into Likert scores.

Show that strong LLMs (text‑davinci‑003, ChatGPT) correlate with expert teachers on story quality and adversarial-sample quality and reproduce relative rankings.

Key Findings

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

NumbersKendall's τ up to 0.38 (relevance) vs teachers

Practical UseUse text‑davinci‑003 to rank items similarly to experts, especially for relevance checks.

Evidence RefTable 2; §3.3.1

text‑davinci‑003 and ChatGPT prefer human-written stories over GPT-2 outputs, matching expert teachers.

NumbersPreference differences statistically significant by Welch's t-test (p < 0.05)

Practical UseLLMevaluation with these LLMs can replace low-quality crowd labels for relative system comparisons.

Evidence RefTable 1; §3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Kendall's τ between text-davinci-003 and teachers	τ = 0.14–0.38 across attributes; strongest on relevance	—	—	WritingPrompts (200 stories)	Table 2; §3.3.1	Table 2
LLM vs human mean Likert (adversarial fluency)	ChatGPT: benign 4.32, TextFooler 2.12, PWWS 2.42, BAE 3.71	Human: benign 4.55, TextFooler 2.17, PWWS 2.16, BAE 3.01	LLMs rate adversarial samples higher than humans (e.g., BAE +0.7)	AG-News adversarial samples (100 per attack)	Table 4; §4.3	Table 4

What To Try In 7 Days

Run LLMevaluation (same prompt used for humans) with text-davinci-003 on a held-out set to get a development-time ranking.

Sanity-check LLM scores with a small panel of experts on a few examples before trusting absolute values.

Fix one prompt and seed; document model and sampling parameters to ensure reproducibility across runs.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.kaggle.com/datasets/ratthachat/writingprompts

Risks & Boundaries

Limitations

Not reliable for tasks requiring factual verification: LLMs can hallucinate or hold incorrect facts (§5).

Safety and sentiment bias: safety-tuned models (ChatGPT) can systematically downscore content they label harmful or profane (§3.3, §5).

When Not To Use

Tasks requiring reliable factual or world-knowledge checks (fact-checking).

Evaluations that depend on human emotions, lived experience, or preferences.

Failure Modes

Systematic rating bias (e.g., some LLMs consistently rate higher or lower than humans).

Prompt sensitivity causing absolute-score shifts if prompt wording changes.

Core Entities

Models

text-davinci-003text-curie-001T0 (T0pp)ChatGPTGPT-2 (fine-tuned medium)

Metrics

5-point Likert scaleKendall's τKrippendorff's αWelch's t-test (statistical significance)

Datasets

WritingPromptsAG-NewsYoo et al. adversarial samples (against BERT on AG-News)

Context Entities

Models

BERT-base-uncased (victim classifier)

Metrics

inter-annotator agreement (exact percentage)

Datasets

WritingPrompts (Kaggle mirror)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

text‑davinci‑003 and ChatGPT prefer human-written stories over GPT-2 outputs, matching expert teachers.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding