Overview
The experiments cover four tasks and several datasets and consistently show ChatGPT Explicit scoring beating many baselines, but results depend on proprietary models and limited prompt variations.
Citations19
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
You can use ChatGPT to score generated text without references and get evaluations closer to human judgments than many automatic metrics, which speeds up model iteration and reduces reliance on hand-built references.
Who Should Care
Summary TLDR
This study tests using large language models (LLMs) — mainly ChatGPT and text-davinci variants — as reference-free judges of generated text. The authors compare three approaches: Explicit Score (ask the model to give a numeric score), Implicit Score (use the model's yes/no token probability), and Pairwise Comparison (ask which of two texts is better). Across four tasks and several datasets, ChatGPT's Explicit Score correlates with human judgments better than many classic metrics. Implicit Scores can help on hard comparisons but are generally less discriminative. Direct pairwise comparison with ChatGPT often performs worse than single-text scoring. Prompt wording and decoding (greedy vs. top‑
Problem Statement
Automatic metrics that compare generated text to references miss many valid outputs. We need reliable reference-free ways to judge text quality (coherence, fluency, consistency, relevance) across tasks. The paper asks: can LLMs (ChatGPT, text-davinci) be used as reference-free evaluators, which scoring method works best, and how do prompt and decoding choices affect reliability?
Main Contribution
Systematic comparison of three reference-free LLM-based evaluators: Explicit Score, Implicit Score, and Pairwise Comparison.
Large-scale empirical tests across four tasks: summarization, dialogue, story generation, paraphrase (using SummEval, FED, OpenMEVA-ROC, Twitter-Extend).
Key Findings
ChatGPT's Explicit Score aligns with human judgments better than many automatic metrics on multiple tasks.
Explicit scoring outperforms Implicit (model-confidence) scoring on most datasets.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| SummEval (coherence) Spearman | ChatGPT Explicit (greedy) 52.2 | BARTScore 33.4 | +18.8 | SummEval (sample-level) | Table 1: ChatGPT (greedy) 52.2 vs BARTScore 33.4 | Table 1 |
| FED (overall) Spearman | ChatGPT Explicit (greedy) 49.9 | text-davinci-003 Implicit 30.3 | +19.6 | FED (dataset-level) | Table 2: ChatGPT (greedy) 49.9 vs text-davinci-003 implicit 30.3 | Table 2 |
What To Try In 7 Days
Run ChatGPT Explicit Score (0–100) on a representative sample of your outputs using greedy decoding.
Compare Spearman/Kendall correlations against your current automatic metrics using a small human-labeled subset.
Avoid pairwise prompts for low-quality candidate pools; prefer per-sample explicit scoring first round.
Reproducibility
Risks & Boundaries
Limitations
Meta-evaluation relies on correlation with human labels; human label noise can distort metric ranking.
Coverage limited to four short-text tasks; results may not hold for very high-quality or long-form text.
When Not To Use
When you need reference-based exact matches or fine-grained factual consistency checking.
When evaluating very high-quality outputs where discrimination is subtle.
Failure Modes
ChatGPT tends to judge many candidate texts as generally low quality, making pairwise distinctions unstable.
Pairwise prompts can bias early token choices (prompt artifacts) and reduce ranking reliability.

