ChatGPT can score generated text without references — explicit numeric scores work best; pairwise comparisons often underperform.

April 3, 20237 min

Overview

Decision SnapshotNeeds Validation

The experiments cover four tasks and several datasets and consistently show ChatGPT Explicit scoring beating many baselines, but results depend on proprietary models and limited prompt variations.

Citations19

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu

Links

Abstract / PDF / Data

Why It Matters For Business

You can use ChatGPT to score generated text without references and get evaluations closer to human judgments than many automatic metrics, which speeds up model iteration and reduces reliance on hand-built references.

Who Should Care

Summary TLDR

This study tests using large language models (LLMs) — mainly ChatGPT and text-davinci variants — as reference-free judges of generated text. The authors compare three approaches: Explicit Score (ask the model to give a numeric score), Implicit Score (use the model's yes/no token probability), and Pairwise Comparison (ask which of two texts is better). Across four tasks and several datasets, ChatGPT's Explicit Score correlates with human judgments better than many classic metrics. Implicit Scores can help on hard comparisons but are generally less discriminative. Direct pairwise comparison with ChatGPT often performs worse than single-text scoring. Prompt wording and decoding (greedy vs. top‑

Problem Statement

Automatic metrics that compare generated text to references miss many valid outputs. We need reliable reference-free ways to judge text quality (coherence, fluency, consistency, relevance) across tasks. The paper asks: can LLMs (ChatGPT, text-davinci) be used as reference-free evaluators, which scoring method works best, and how do prompt and decoding choices affect reliability?

Main Contribution

Systematic comparison of three reference-free LLM-based evaluators: Explicit Score, Implicit Score, and Pairwise Comparison.

Large-scale empirical tests across four tasks: summarization, dialogue, story generation, paraphrase (using SummEval, FED, OpenMEVA-ROC, Twitter-Extend).

Key Findings

ChatGPT's Explicit Score aligns with human judgments better than many automatic metrics on multiple tasks.

NumbersSummEval (coherence) Spearman: ChatGPT (greedy) 52.2 vs BARTScore 33.4 (Table 1).

Practical UsePrefer asking ChatGPT for a single numeric score (0–100) and use greedy decoding to get evaluations that better match human ratings.

Evidence RefTable 1 (SummEval COH)

Explicit scoring outperforms Implicit (model-confidence) scoring on most datasets.

NumbersFED overall Spearman: text-davinci-003 Implicit 30.3 vs ChatGPT Explicit (greedy) 49.9 (Table 2).

Practical UseUse explicit numeric prompts with ChatGPT instead of relying on model token-probabilities when possible.

Evidence RefTable 2 (FED overall)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SummEval (coherence) SpearmanChatGPT Explicit (greedy) 52.2BARTScore 33.4+18.8SummEval (sample-level)Table 1: ChatGPT (greedy) 52.2 vs BARTScore 33.4Table 1
FED (overall) SpearmanChatGPT Explicit (greedy) 49.9text-davinci-003 Implicit 30.3+19.6FED (dataset-level)Table 2: ChatGPT (greedy) 49.9 vs text-davinci-003 implicit 30.3Table 2

What To Try In 7 Days

Run ChatGPT Explicit Score (0–100) on a representative sample of your outputs using greedy decoding.

Compare Spearman/Kendall correlations against your current automatic metrics using a small human-labeled subset.

Avoid pairwise prompts for low-quality candidate pools; prefer per-sample explicit scoring first round.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Meta-evaluation relies on correlation with human labels; human label noise can distort metric ranking.

Coverage limited to four short-text tasks; results may not hold for very high-quality or long-form text.

When Not To Use

When you need reference-based exact matches or fine-grained factual consistency checking.

When evaluating very high-quality outputs where discrimination is subtle.

Failure Modes

ChatGPT tends to judge many candidate texts as generally low quality, making pairwise distinctions unstable.

Pairwise prompts can bias early token choices (prompt artifacts) and reduce ranking reliability.

Core Entities

Models

ChatGPT (gpt3.5-turbo-0301)text-davinci-003text-davinci-001GPT-2 (baseline)

Metrics

Explicit Score (LLM-generated numeric 0–100)Implicit Score (yes/no token probability)Pairwise ComparisonROUGE-1ROUGE-2ROUGE-LBERTScoreMoverScorePRISMBARTScore (+CNN, +CNN+Para)ParaScoreiBLEUPerplexityKendall's Tau-bSpearmanPearson

Datasets

SummEvalFED (dialog-level)OpenMEVA-ROCTwitter-Para (Twitter (Extend))