ChatGPT can score generated text without references — explicit numeric scores work best; pairwise comparisons often underperform.

Overview

Decision SnapshotNeeds Validation

The experiments cover four tasks and several datasets and consistently show ChatGPT Explicit scoring beating many baselines, but results depend on proprietary models and limited prompt variations.

Citations19

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu

Links

Abstract / PDF / Data

Why It Matters For Business

You can use ChatGPT to score generated text without references and get evaluations closer to human judgments than many automatic metrics, which speeds up model iteration and reduces reliance on hand-built references.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

This study tests using large language models (LLMs) — mainly ChatGPT and text-davinci variants — as reference-free judges of generated text. The authors compare three approaches: Explicit Score (ask the model to give a numeric score), Implicit Score (use the model's yes/no token probability), and Pairwise Comparison (ask which of two texts is better). Across four tasks and several datasets, ChatGPT's Explicit Score correlates with human judgments better than many classic metrics. Implicit Scores can help on hard comparisons but are generally less discriminative. Direct pairwise comparison with ChatGPT often performs worse than single-text scoring. Prompt wording and decoding (greedy vs. top‑

Problem Statement

Automatic metrics that compare generated text to references miss many valid outputs. We need reliable reference-free ways to judge text quality (coherence, fluency, consistency, relevance) across tasks. The paper asks: can LLMs (ChatGPT, text-davinci) be used as reference-free evaluators, which scoring method works best, and how do prompt and decoding choices affect reliability?

Main Contribution

Systematic comparison of three reference-free LLM-based evaluators: Explicit Score, Implicit Score, and Pairwise Comparison.

Large-scale empirical tests across four tasks: summarization, dialogue, story generation, paraphrase (using SummEval, FED, OpenMEVA-ROC, Twitter-Extend).

Key Findings

ChatGPT's Explicit Score aligns with human judgments better than many automatic metrics on multiple tasks.

NumbersSummEval (coherence) Spearman: ChatGPT (greedy) 52.2 vs BARTScore 33.4 (Table 1).

Practical UsePrefer asking ChatGPT for a single numeric score (0–100) and use greedy decoding to get evaluations that better match human ratings.

Evidence RefTable 1 (SummEval COH)

Explicit scoring outperforms Implicit (model-confidence) scoring on most datasets.

NumbersFED overall Spearman: text-davinci-003 Implicit 30.3 vs ChatGPT Explicit (greedy) 49.9 (Table 2).

Practical UseUse explicit numeric prompts with ChatGPT instead of relying on model token-probabilities when possible.

Evidence RefTable 2 (FED overall)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SummEval (coherence) Spearman	ChatGPT Explicit (greedy) 52.2	BARTScore 33.4	+18.8	SummEval (sample-level)	Table 1: ChatGPT (greedy) 52.2 vs BARTScore 33.4	Table 1
FED (overall) Spearman	ChatGPT Explicit (greedy) 49.9	text-davinci-003 Implicit 30.3	+19.6	FED (dataset-level)	Table 2: ChatGPT (greedy) 49.9 vs text-davinci-003 implicit 30.3	Table 2

What To Try In 7 Days

Run ChatGPT Explicit Score (0–100) on a representative sample of your outputs using greedy decoding.

Compare Spearman/Kendall correlations against your current automatic metrics using a small human-labeled subset.

Avoid pairwise prompts for low-quality candidate pools; prefer per-sample explicit scoring first round.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/MilkWhite/LLMs_for_Reference_Free_Text_Quality_Evaluation

Risks & Boundaries

Limitations

Meta-evaluation relies on correlation with human labels; human label noise can distort metric ranking.

Coverage limited to four short-text tasks; results may not hold for very high-quality or long-form text.

When Not To Use

When you need reference-based exact matches or fine-grained factual consistency checking.

When evaluating very high-quality outputs where discrimination is subtle.

Failure Modes

ChatGPT tends to judge many candidate texts as generally low quality, making pairwise distinctions unstable.

Pairwise prompts can bias early token choices (prompt artifacts) and reduce ranking reliability.

Core Entities

Models

ChatGPT (gpt3.5-turbo-0301)text-davinci-003text-davinci-001GPT-2 (baseline)

Metrics

Explicit Score (LLM-generated numeric 0–100)Implicit Score (yes/no token probability)Pairwise ComparisonROUGE-1ROUGE-2ROUGE-LBERTScoreMoverScorePRISMBARTScore (+CNN, +CNN+Para)ParaScoreiBLEUPerplexityKendall's Tau-bSpearmanPearson

Datasets

SummEvalFED (dialog-level)OpenMEVA-ROCTwitter-Para (Twitter (Extend))

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT's Explicit Score aligns with human judgments better than many automatic metrics on multiple tasks.

Explicit scoring outperforms Implicit (model-confidence) scoring on most datasets.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding