A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

February 8, 20239 min

Overview

Decision SnapshotNeeds Validation

The paper gives a wide, practical snapshot: ChatGPT is production-ready for many low-risk tasks, useful for prototyping, but not reliable for complex reasoning or critical factual decisions without external checks.

Citations352

Evidence Strength0.78

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 7/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 25%

Authors

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung

Links

Abstract / PDF / Code

Why It Matters For Business

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Who Should Care

Summary TLDR

This paper runs a practical, third-party benchmark of ChatGPT (Dec 15, 2022 UI) across 21–23 public datasets for 8 NLP tasks plus a new flag-drawing test. Key results: ChatGPT beats prior zero-shot LLMs on most tasks (9/13 datasets), even surpasses some fully fine-tuned models on 4 tasks, but shows weak and inconsistent reasoning (avg ~63.4% over 10 reasoning categories), frequent extrinsic hallucinations, and poor generation for low-resource/non-Latin scripts. Multi-turn interaction helps: ~8 ROUGE-1 points for summarization and ~2 ChrF++ for low-resource MT via post-editing. Code for dataset extraction is released.

Problem Statement

Interactive LLMs like ChatGPT are widely used but lack independent, reproducible third-party evaluation across tasks, languages, vision capability, reasoning, hallucination, and the value of multi-turn interaction. The paper fills that gap with quantitative tests using public datasets.

Main Contribution

A reproducible evaluation framework for interactive LLMs using 21–23 public datasets across 8 NLP tasks and a new 50-flag multimodal task

Extensive zero-shot benchmarking of ChatGPT (Dec 15, 2022 UI) on multitask, multilingual, and multimodal fronts, plus a follow-up GPT-4 comparison

Key Findings

ChatGPT often outperforms prior zero-shot LLMs.

Numbers9/13 evaluated datasets (zero-shot comparisons)

Practical UseUse ChatGPT as a strong zero-shot baseline for many NLP tasks, but still compare to task‑specific fine-tuned models for dialogue and knowledge-grounded tasks

Evidence RefTable 1; §2.1

ChatGPT can beat some fully fine-tuned models on certain tasks.

Numberssurpassed fine-tuned SOTA on 4 datasets

Practical UseFor select tasks, try zero-shot ChatGPT before investing in custom fine-tuning; validate on held-out, task-specific metrics

Evidence RefAbstract; §2.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Summarization ROUGE-135.29 (ChatGPT, sample subset)44.47 (fine-tuned SOTA BART)-9.18CNN/DM (subset)Table 1 (CNN/DM); §C.1Table 1
Machine Translation ChrF++ (XXX→Eng)58.64 (ChatGPT, HRL avg)63.5 (fine-tuned SOTA)-4.86FLoRes-200 HRL subsetTable 1; §C.2Table 1

What To Try In 7 Days

Run ChatGPT zero-shot on your core text tasks as a baseline and compare to current models (use 30–200 held-out samples)

Add a short multi-turn refinement step (one follow-up prompt) for summarization and translation to capture ~8 ROUGE-1 / ~2 ChrF++ gains

For non-English targets, test native-speaker validation and consider post-editing pipelines before deployment

Agent Features

Memory
short-term multi-turn context (dialog history)no reliable long-term belief-state retention across turns
Tool Use
code generation as multimodal bridge (SVG/Canvas code output)
Frameworks
RLHF (used for alignment)
Architectures
Transformer (GPT family)RLHF-fine-tuned dialog model
Collaboration
human-in-the-loop editing; multi-turn prompt post-editing

Optimization Features

Training Optimization
RL

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small per-task samples (30–200) limit statistical strength of some claims

Experiments use web UI (Dec 15, 2022) not API; results may differ on newer model versions

When Not To Use

High-stakes decision-making or scientific claims without external verification

Production machine translation into extremely low-resource or non-Latin scripts without native post-editing

Failure Modes

Extrinsic hallucination: plausible but unverifiable or false facts injected into outputs

Lazy or incomplete reasoning: fails inductive or multi-hop steps unless prompted to 'do reasonable inference'

Core Entities

Models

ChatGPT (Dec 15 2022 UI)GPT-4InstructGPTtext-davinci-002text-davinci-003BART (fine-tuned baselines)NLLB-200

Metrics

ROUGE-1ROUGE-2ChrF++BLEUMacro F1JGA (joint goal acc)AUCHTERSacreBLEUMETEORBERTScore

Datasets

CNN/DailyMailSAMSumFLoRes-200NusaXbAbIEntailmentBankCLUTRRMATHTimeDialStepGameSpartQACommonsenseQAPiQAPep-3kHotpotQAOpenDialKGMultiWOZ2.2COVID-SocialCOVID-ScientificTruthfulQAFlag drawing (50 flags)

Benchmarks

FLoRes-200HotpotQACommonsenseQAMATHSAMSum/CNN-DM

Context Entities

Models

GPT-3 familyST-MoE-32BXLM-RNLLB

Datasets

BIG-BenchAI LM HarnessHELM