A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

Overview

Decision SnapshotNeeds Validation

The paper gives a wide, practical snapshot: ChatGPT is production-ready for many low-risk tasks, useful for prototyping, but not reliable for complex reasoning or critical factual decisions without external checks.

Citations352

Evidence Strength0.78

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 7/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 25%

Authors

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung

Links

Abstract / PDF / Code

Why It Matters For Business

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Who Should Care

Product Manager ML Engineer CTO Founder Data Scientist

Summary TLDR

This paper runs a practical, third-party benchmark of ChatGPT (Dec 15, 2022 UI) across 21–23 public datasets for 8 NLP tasks plus a new flag-drawing test. Key results: ChatGPT beats prior zero-shot LLMs on most tasks (9/13 datasets), even surpasses some fully fine-tuned models on 4 tasks, but shows weak and inconsistent reasoning (avg ~63.4% over 10 reasoning categories), frequent extrinsic hallucinations, and poor generation for low-resource/non-Latin scripts. Multi-turn interaction helps: ~8 ROUGE-1 points for summarization and ~2 ChrF++ for low-resource MT via post-editing. Code for dataset extraction is released.

Problem Statement

Interactive LLMs like ChatGPT are widely used but lack independent, reproducible third-party evaluation across tasks, languages, vision capability, reasoning, hallucination, and the value of multi-turn interaction. The paper fills that gap with quantitative tests using public datasets.

Main Contribution

A reproducible evaluation framework for interactive LLMs using 21–23 public datasets across 8 NLP tasks and a new 50-flag multimodal task

Extensive zero-shot benchmarking of ChatGPT (Dec 15, 2022 UI) on multitask, multilingual, and multimodal fronts, plus a follow-up GPT-4 comparison

Key Findings

ChatGPT often outperforms prior zero-shot LLMs.

Numbers9/13 evaluated datasets (zero-shot comparisons)

Practical UseUse ChatGPT as a strong zero-shot baseline for many NLP tasks, but still compare to task‑specific fine-tuned models for dialogue and knowledge-grounded tasks

Evidence RefTable 1; §2.1

ChatGPT can beat some fully fine-tuned models on certain tasks.

Numberssurpassed fine-tuned SOTA on 4 datasets

Practical UseFor select tasks, try zero-shot ChatGPT before investing in custom fine-tuning; validate on held-out, task-specific metrics

Evidence RefAbstract; §2.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Summarization ROUGE-1	35.29 (ChatGPT, sample subset)	44.47 (fine-tuned SOTA BART)	-9.18	CNN/DM (subset)	Table 1 (CNN/DM); §C.1	Table 1
Machine Translation ChrF++ (XXX→Eng)	58.64 (ChatGPT, HRL avg)	63.5 (fine-tuned SOTA)	-4.86	FLoRes-200 HRL subset	Table 1; §C.2	Table 1

What To Try In 7 Days

Run ChatGPT zero-shot on your core text tasks as a baseline and compare to current models (use 30–200 held-out samples)

Add a short multi-turn refinement step (one follow-up prompt) for summarization and translation to capture ~8 ROUGE-1 / ~2 ChrF++ gains

For non-English targets, test native-speaker validation and consider post-editing pipelines before deployment

Agent Features

Memory

short-term multi-turn context (dialog history)no reliable long-term belief-state retention across turns

Tool Use

code generation as multimodal bridge (SVG/Canvas code output)

Frameworks

RLHF (used for alignment)

Architectures

Transformer (GPT family)RLHF-fine-tuned dialog model

Collaboration

human-in-the-loop editing; multi-turn prompt post-editing

Optimization Features

Training Optimization

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/HLTCHKUST/chatgpt-evaluat

Risks & Boundaries

Limitations

Small per-task samples (30–200) limit statistical strength of some claims

Experiments use web UI (Dec 15, 2022) not API; results may differ on newer model versions

When Not To Use

High-stakes decision-making or scientific claims without external verification

Production machine translation into extremely low-resource or non-Latin scripts without native post-editing

Failure Modes

Extrinsic hallucination: plausible but unverifiable or false facts injected into outputs

Lazy or incomplete reasoning: fails inductive or multi-hop steps unless prompted to 'do reasonable inference'

Core Entities

Models

ChatGPT (Dec 15 2022 UI)GPT-4InstructGPTtext-davinci-002text-davinci-003BART (fine-tuned baselines)NLLB-200

Metrics

ROUGE-1ROUGE-2ChrF++BLEUMacro F1JGA (joint goal acc)AUCHTERSacreBLEUMETEORBERTScore

Datasets

CNN/DailyMailSAMSumFLoRes-200NusaXbAbIEntailmentBankCLUTRRMATHTimeDialStepGameSpartQACommonsenseQAPiQAPep-3kHotpotQAOpenDialKGMultiWOZ2.2COVID-SocialCOVID-ScientificTruthfulQAFlag drawing (50 flags)

Benchmarks

FLoRes-200HotpotQACommonsenseQAMATHSAMSum/CNN-DM

Context Entities

Models

GPT-3 familyST-MoE-32BXLM-RNLLB

Datasets

BIG-BenchAI LM HarnessHELM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT often outperforms prior zero-shot LLMs.

ChatGPT can beat some fully fine-tuned models on certain tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding