LLMs (text‑davinci‑003, ChatGPT) can mimic expert human ratings on story quality and adversarial text, cheaply and reproducibly.

May 3, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

31

Authors

Cheng-Han Chiang, Hung-yi Lee

Links

Abstract / PDF

Why It Matters For Business

LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.

Summary TLDR

The paper defines "LLMevaluation": give an LLM the exact human evaluation prompt, the sample, and the rating question, then parse the LLM's output as a score. On two tasks (open-ended story scoring and adversarial-text quality) the best LLMs (text-davinci-003 and ChatGPT) rank outputs similarly to expert English teachers, are stable to small instruction and sampling changes, and are far cheaper and faster than hiring experts. The authors note clear limits: LLMs can be biased by safety tuning, lack reliable factual knowledge or emotions, and may refuse or alter judgments on policy-sensitive content. They recommend using LLMevaluation as a fast, reproducible development tool, not a full human替

Problem Statement

Human evaluation is necessary but costly, slow, inconsistent and hard to reproduce. The paper asks: can strong LLMs act as reliable, cheaper substitutes for human judges when scoring text quality?

Main Contribution

Define LLMevaluation: feed LLMs the same instruction, sample, and rating question used in human studies and parse their free-text replies into Likert scores.

Show that strong LLMs (text‑davinci‑003, ChatGPT) correlate with expert teachers on story quality and adversarial-sample quality and reproduce relative rankings.

Measure sensitivity to prompts and sampling, report cost/time gains, and discuss ethical and failure modes of replacing humans with LLM judges.

Key Findings

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

NumbersKendall's τ up to 0.38 (relevance) vs teachers

text‑davinci‑003 and ChatGPT prefer human-written stories over GPT-2 outputs, matching expert teachers.

NumbersPreference differences statistically significant by Welch's t-test (p < 0.05)

LLMevaluation is far cheaper and faster than hiring expert annotators.

NumbersHuman teacher cost US$140 vs InstructGPT cost <US$5 for 200 stories

LLM ratings are stable to small instruction and sampling changes.

NumbersMean rating shifts ≤ 0.25 across instruction variants and temperatures

On adversarial text quality LLMs detect damage but are gentler than experts.

NumbersChatGPT fluency: benign 4.32 vs TextFooler 2.12 (Likert)

Results

Kendall's τ between text-davinci-003 and teachers

Valueτ = 0.14–0.38 across attributes; strongest on relevance

LLM vs human mean Likert (adversarial fluency)

ValueChatGPT: benign 4.32, TextFooler 2.12, PWWS 2.42, BAE 3.71

BaselineHuman: benign 4.55, TextFooler 2.17, PWWS 2.16, BAE 3.01

Statistical significance of LLM preference

ValueWelch's t-test p < 0.05 for text-davinci-003 preference

Cost to evaluate 200 stories

ValueHuman US$140 vs InstructGPT <US$5

Sensitivity to instruction/sampling

ValueMean score changes ≤ 0.25 when adding persona/explain or varying temperature

Who Should Care

What To Try In 7 Days

Run LLMevaluation (same prompt used for humans) with text-davinci-003 on a held-out set to get a development-time ranking.

Sanity-check LLM scores with a small panel of experts on a few examples before trusting absolute values.

Fix one prompt and seed; document model and sampling parameters to ensure reproducibility across runs.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not reliable for tasks requiring factual verification: LLMs can hallucinate or hold incorrect facts (§5).
  • Safety and sentiment bias: safety-tuned models (ChatGPT) can systematically downscore content they label harmful or profane (§3.3, §5).
  • Lack of emotions and human perspective: LLMs may refuse emotion-based ratings or judge differently from humans on subjective attributes (§5, limitations).
  • Cannot use visual formatting cues that humans can use in instructions; LLMs only accept raw text (§5).
  • Provider changes may break reproducibility if the hosted model updates or becomes unavailable (§5).

When Not To Use

  • Tasks requiring reliable factual or world-knowledge checks (fact-checking).
  • Evaluations that depend on human emotions, lived experience, or preferences.
  • Final human-facing assessments before deployment without follow-up human testing.
  • Content that triggers LLM content filters or refusal (policy-sensitive inputs).

Failure Modes

  • Systematic rating bias (e.g., some LLMs consistently rate higher or lower than humans).
  • Prompt sensitivity causing absolute-score shifts if prompt wording changes.
  • Model refusal or redaction on sensitive inputs, creating missing data.
  • Disagreement with human experts on subjective attributes (likability, coherence).

Core Entities

Models

  • text-davinci-003
  • text-curie-001
  • T0 (T0pp)
  • ChatGPT
  • GPT-2 (fine-tuned medium)

Metrics

  • 5-point Likert scale
  • Kendall's τ
  • Krippendorff's α
  • Welch's t-test (statistical significance)

Datasets

  • WritingPrompts
  • AG-News
  • Yoo et al. adversarial samples (against BERT on AG-News)

Context Entities

Models

  • BERT-base-uncased (victim classifier)

Metrics

  • inter-annotator agreement (exact percentage)

Datasets

  • WritingPrompts (Kaggle mirror)