A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

February 8, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.25

Cost Impact Score

0.75

Citation Count

352

Authors

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung

Links

Abstract / PDF

Why It Matters For Business

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Summary TLDR

This paper runs a practical, third-party benchmark of ChatGPT (Dec 15, 2022 UI) across 21–23 public datasets for 8 NLP tasks plus a new flag-drawing test. Key results: ChatGPT beats prior zero-shot LLMs on most tasks (9/13 datasets), even surpasses some fully fine-tuned models on 4 tasks, but shows weak and inconsistent reasoning (avg ~63.4% over 10 reasoning categories), frequent extrinsic hallucinations, and poor generation for low-resource/non-Latin scripts. Multi-turn interaction helps: ~8 ROUGE-1 points for summarization and ~2 ChrF++ for low-resource MT via post-editing. Code for dataset extraction is released.

Problem Statement

Interactive LLMs like ChatGPT are widely used but lack independent, reproducible third-party evaluation across tasks, languages, vision capability, reasoning, hallucination, and the value of multi-turn interaction. The paper fills that gap with quantitative tests using public datasets.

Main Contribution

A reproducible evaluation framework for interactive LLMs using 21–23 public datasets across 8 NLP tasks and a new 50-flag multimodal task

Extensive zero-shot benchmarking of ChatGPT (Dec 15, 2022 UI) on multitask, multilingual, and multimodal fronts, plus a follow-up GPT-4 comparison

Targeted analyses of reasoning (10 categories, 634 samples), hallucination types, and gains from multi-turn interactivity; evaluation code released

Key Findings

ChatGPT often outperforms prior zero-shot LLMs.

Numbers9/13 evaluated datasets (zero-shot comparisons)

ChatGPT can beat some fully fine-tuned models on certain tasks.

Numberssurpassed fine-tuned SOTA on 4 datasets

Reasoning is inconsistent and unreliable overall.

Numbers63.41% average accuracy across 10 reasoning categories (634 samples)

ChatGPT frequently hallucinates extrinsic facts.

Numbers35.38% failure rate on TruthfulQA (imitative falsehoods); extrinsic hallucinations found across tasks

Language generation degrades on low-resource and non-Latin-script targets.

NumbersChrF++ Eng→LRL: 21.57 (ChatGPT) vs higher SOTA; and very low language ID for X-LRL

Multi-turn interaction measurably improves outputs.

Numbers+7.99 ROUGE-1 on summarization; ~2% ChrF++ gain for low-resource MT (reported)

ChatGPT can produce simple multimodal artifacts via code as an intermediate form.

NumbersFlag-drawing A-grade (errorless) improved from 4% (Turn1) to 24% (Turn3)

Evaluation sample sizes are small and UI-version dependent.

NumbersPer-task ChatGPT eval: 30–200 samples; experiments done with Dec 15, 2022 web UI

Results

Summarization ROUGE-1

Value35.29 (ChatGPT, sample subset)

Baseline44.47 (fine-tuned SOTA BART)

Machine Translation ChrF++ (XXX→Eng)

Value58.64 (ChatGPT, HRL avg)

Baseline63.5 (fine-tuned SOTA)

Machine Translation ChrF++ (Eng→LRL)

Value21.57 (ChatGPT)

Baseline41.9 (fine-tuned SOTA)

Sentiment Analysis Macro F1 (NusaX Eng)

Value83.24 (ChatGPT)

Baseline92.6 (fine-tuned SOTA)

Accuracy

Value63.41% (ChatGPT average across 10 categories, 634 samples)

Misinformation detection (COVID-scientific)

Value92% (ChatGPT, 46/50)

Baseline74.7 (fine-tuned SOTA reported)

Interactivity gain (summarization)

Value+7.99 ROUGE-1 (refined 2nd-turn summaries)

Baselinefirst-turn ChatGPT summary

Flag drawing (A-grade, 0 errors)

Value24% (Turn 3)

Baseline4% (Turn 1)

Who Should Care

What To Try In 7 Days

Run ChatGPT zero-shot on your core text tasks as a baseline and compare to current models (use 30–200 held-out samples)

Add a short multi-turn refinement step (one follow-up prompt) for summarization and translation to capture ~8 ROUGE-1 / ~2 ChrF++ gains

For non-English targets, test native-speaker validation and consider post-editing pipelines before deployment

Agent Features

Memory

  • short-term multi-turn context (dialog history)
  • no reliable long-term belief-state retention across turns

Tool Use

  • code generation as multimodal bridge (SVG/Canvas code output)

Frameworks

  • RLHF (used for alignment)

Architectures

  • Transformer (GPT family)
  • RLHF-fine-tuned dialog model

Collaboration

  • human-in-the-loop editing; multi-turn prompt post-editing

Optimization Features

Training Optimization

  • RL

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small per-task samples (30–200) limit statistical strength of some claims
  • Experiments use web UI (Dec 15, 2022) not API; results may differ on newer model versions
  • Automatic metrics undercount quality for open responses; authors note human judgments sometimes diverge
  • Multimodal test uses code-mediated generation (SVG), not native vision-language models

When Not To Use

  • High-stakes decision-making or scientific claims without external verification
  • Production machine translation into extremely low-resource or non-Latin scripts without native post-editing
  • Tasks requiring reliable multi-hop, mathematical, or spatial reasoning

Failure Modes

  • Extrinsic hallucination: plausible but unverifiable or false facts injected into outputs
  • Lazy or incomplete reasoning: fails inductive or multi-hop steps unless prompted to 'do reasonable inference'
  • Belief-state drift in multi-turn task-oriented dialogue unless user explicitly restates earlier constraints
  • Poor generation quality for non-Latin scripts and extremely low-resource languages

Core Entities

Models

  • ChatGPT (Dec 15 2022 UI)
  • GPT-4
  • InstructGPT
  • text-davinci-002
  • text-davinci-003
  • BART (fine-tuned baselines)
  • NLLB-200

Metrics

  • ROUGE-1
  • ROUGE-2
  • ChrF++
  • BLEU
  • Macro F1
  • JGA (joint goal acc)
  • AUC
  • HTER
  • SacreBLEU
  • METEOR
  • BERTScore

Datasets

  • CNN/DailyMail
  • SAMSum
  • FLoRes-200
  • NusaX
  • bAbI
  • EntailmentBank
  • CLUTRR
  • MATH
  • TimeDial
  • StepGame
  • SpartQA
  • CommonsenseQA
  • PiQA
  • Pep-3k
  • HotpotQA
  • OpenDialKG
  • MultiWOZ2.2
  • COVID-Social
  • COVID-Scientific
  • TruthfulQA
  • Flag drawing (50 flags)

Benchmarks

  • FLoRes-200
  • HotpotQA
  • CommonsenseQA
  • MATH
  • SAMSum/CNN-DM

Context Entities

Models

  • GPT-3 family
  • ST-MoE-32B
  • XLM-R
  • NLLB

Datasets

  • BIG-Bench
  • AI LM Harness
  • HELM