Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
31
Why It Matters For Business
LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.
Summary TLDR
The paper defines "LLMevaluation": give an LLM the exact human evaluation prompt, the sample, and the rating question, then parse the LLM's output as a score. On two tasks (open-ended story scoring and adversarial-text quality) the best LLMs (text-davinci-003 and ChatGPT) rank outputs similarly to expert English teachers, are stable to small instruction and sampling changes, and are far cheaper and faster than hiring experts. The authors note clear limits: LLMs can be biased by safety tuning, lack reliable factual knowledge or emotions, and may refuse or alter judgments on policy-sensitive content. They recommend using LLMevaluation as a fast, reproducible development tool, not a full human替
Problem Statement
Human evaluation is necessary but costly, slow, inconsistent and hard to reproduce. The paper asks: can strong LLMs act as reliable, cheaper substitutes for human judges when scoring text quality?
Main Contribution
Define LLMevaluation: feed LLMs the same instruction, sample, and rating question used in human studies and parse their free-text replies into Likert scores.
Show that strong LLMs (text‑davinci‑003, ChatGPT) correlate with expert teachers on story quality and adversarial-sample quality and reproduce relative rankings.
Measure sensitivity to prompts and sampling, report cost/time gains, and discuss ethical and failure modes of replacing humans with LLM judges.
Key Findings
A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.
text‑davinci‑003 and ChatGPT prefer human-written stories over GPT-2 outputs, matching expert teachers.
LLMevaluation is far cheaper and faster than hiring expert annotators.
LLM ratings are stable to small instruction and sampling changes.
On adversarial text quality LLMs detect damage but are gentler than experts.
Results
Kendall's τ between text-davinci-003 and teachers
LLM vs human mean Likert (adversarial fluency)
Statistical significance of LLM preference
Cost to evaluate 200 stories
Sensitivity to instruction/sampling
Who Should Care
What To Try In 7 Days
Run LLMevaluation (same prompt used for humans) with text-davinci-003 on a held-out set to get a development-time ranking.
Sanity-check LLM scores with a small panel of experts on a few examples before trusting absolute values.
Fix one prompt and seed; document model and sampling parameters to ensure reproducibility across runs.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not reliable for tasks requiring factual verification: LLMs can hallucinate or hold incorrect facts (§5).
- Safety and sentiment bias: safety-tuned models (ChatGPT) can systematically downscore content they label harmful or profane (§3.3, §5).
- Lack of emotions and human perspective: LLMs may refuse emotion-based ratings or judge differently from humans on subjective attributes (§5, limitations).
- Cannot use visual formatting cues that humans can use in instructions; LLMs only accept raw text (§5).
- Provider changes may break reproducibility if the hosted model updates or becomes unavailable (§5).
When Not To Use
- Tasks requiring reliable factual or world-knowledge checks (fact-checking).
- Evaluations that depend on human emotions, lived experience, or preferences.
- Final human-facing assessments before deployment without follow-up human testing.
- Content that triggers LLM content filters or refusal (policy-sensitive inputs).
Failure Modes
- Systematic rating bias (e.g., some LLMs consistently rate higher or lower than humans).
- Prompt sensitivity causing absolute-score shifts if prompt wording changes.
- Model refusal or redaction on sensitive inputs, creating missing data.
- Disagreement with human experts on subjective attributes (likability, coherence).
Core Entities
Models
- text-davinci-003
- text-curie-001
- T0 (T0pp)
- ChatGPT
- GPT-2 (fine-tuned medium)
Metrics
- 5-point Likert scale
- Kendall's τ
- Krippendorff's α
- Welch's t-test (statistical significance)
Datasets
- WritingPrompts
- AG-News
- Yoo et al. adversarial samples (against BERT on AG-News)
Context Entities
Models
- BERT-base-uncased (victim classifier)
Metrics
- inter-annotator agreement (exact percentage)
Datasets
- WritingPrompts (Kaggle mirror)

