LLMs (text‑davinci‑003, ChatGPT) can mimic expert human ratings on story quality and adversarial text, cheaply and reproducibly.

May 3, 20238 min

Overview

Decision SnapshotNeeds Validation

LLMevaluation is ready as a development-time tool (reproducible, cheap) but not a full replacement for humans in deployment or for factual/emotional tasks.

Citations31

Evidence Strength0.70

Confidence0.80

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 40%

Authors

Cheng-Han Chiang, Hung-yi Lee

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.

Who Should Care

Summary TLDR

The paper defines "LLMevaluation": give an LLM the exact human evaluation prompt, the sample, and the rating question, then parse the LLM's output as a score. On two tasks (open-ended story scoring and adversarial-text quality) the best LLMs (text-davinci-003 and ChatGPT) rank outputs similarly to expert English teachers, are stable to small instruction and sampling changes, and are far cheaper and faster than hiring experts. The authors note clear limits: LLMs can be biased by safety tuning, lack reliable factual knowledge or emotions, and may refuse or alter judgments on policy-sensitive content. They recommend using LLMevaluation as a fast, reproducible development tool, not a full human替

Problem Statement

Human evaluation is necessary but costly, slow, inconsistent and hard to reproduce. The paper asks: can strong LLMs act as reliable, cheaper substitutes for human judges when scoring text quality?

Main Contribution

Define LLMevaluation: feed LLMs the same instruction, sample, and rating question used in human studies and parse their free-text replies into Likert scores.

Show that strong LLMs (text‑davinci‑003, ChatGPT) correlate with expert teachers on story quality and adversarial-sample quality and reproduce relative rankings.

Key Findings

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

NumbersKendall's τ up to 0.38 (relevance) vs teachers

Practical UseUse text‑davinci‑003 to rank items similarly to experts, especially for relevance checks.

Evidence RefTable 2; §3.3.1

text‑davinci‑003 and ChatGPT prefer human-written stories over GPT-2 outputs, matching expert teachers.

NumbersPreference differences statistically significant by Welch's t-test (p < 0.05)

Practical UseLLMevaluation with these LLMs can replace low-quality crowd labels for relative system comparisons.

Evidence RefTable 1; §3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Kendall's τ between text-davinci-003 and teachersτ = 0.140.38 across attributes; strongest on relevanceWritingPrompts (200 stories)Table 2; §3.3.1Table 2
LLM vs human mean Likert (adversarial fluency)ChatGPT: benign 4.32, TextFooler 2.12, PWWS 2.42, BAE 3.71Human: benign 4.55, TextFooler 2.17, PWWS 2.16, BAE 3.01LLMs rate adversarial samples higher than humans (e.g., BAE +0.7)AG-News adversarial samples (100 per attack)Table 4; §4.3Table 4

What To Try In 7 Days

Run LLMevaluation (same prompt used for humans) with text-davinci-003 on a held-out set to get a development-time ranking.

Sanity-check LLM scores with a small panel of experts on a few examples before trusting absolute values.

Fix one prompt and seed; document model and sampling parameters to ensure reproducibility across runs.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Not reliable for tasks requiring factual verification: LLMs can hallucinate or hold incorrect facts (§5).

Safety and sentiment bias: safety-tuned models (ChatGPT) can systematically downscore content they label harmful or profane (§3.3, §5).

When Not To Use

Tasks requiring reliable factual or world-knowledge checks (fact-checking).

Evaluations that depend on human emotions, lived experience, or preferences.

Failure Modes

Systematic rating bias (e.g., some LLMs consistently rate higher or lower than humans).

Prompt sensitivity causing absolute-score shifts if prompt wording changes.

Core Entities

Models

text-davinci-003text-curie-001T0 (T0pp)ChatGPTGPT-2 (fine-tuned medium)

Metrics

5-point Likert scaleKendall's τKrippendorff's αWelch's t-test (statistical significance)

Datasets

WritingPromptsAG-NewsYoo et al. adversarial samples (against BERT on AG-News)

Context Entities

Models

BERT-base-uncased (victim classifier)

Metrics

inter-annotator agreement (exact percentage)

Datasets

WritingPrompts (Kaggle mirror)