Overview
Production Readiness
0.7
Novelty Score
0.25
Cost Impact Score
0.75
Citation Count
352
Why It Matters For Business
ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.
Summary TLDR
This paper runs a practical, third-party benchmark of ChatGPT (Dec 15, 2022 UI) across 21–23 public datasets for 8 NLP tasks plus a new flag-drawing test. Key results: ChatGPT beats prior zero-shot LLMs on most tasks (9/13 datasets), even surpasses some fully fine-tuned models on 4 tasks, but shows weak and inconsistent reasoning (avg ~63.4% over 10 reasoning categories), frequent extrinsic hallucinations, and poor generation for low-resource/non-Latin scripts. Multi-turn interaction helps: ~8 ROUGE-1 points for summarization and ~2 ChrF++ for low-resource MT via post-editing. Code for dataset extraction is released.
Problem Statement
Interactive LLMs like ChatGPT are widely used but lack independent, reproducible third-party evaluation across tasks, languages, vision capability, reasoning, hallucination, and the value of multi-turn interaction. The paper fills that gap with quantitative tests using public datasets.
Main Contribution
A reproducible evaluation framework for interactive LLMs using 21–23 public datasets across 8 NLP tasks and a new 50-flag multimodal task
Extensive zero-shot benchmarking of ChatGPT (Dec 15, 2022 UI) on multitask, multilingual, and multimodal fronts, plus a follow-up GPT-4 comparison
Targeted analyses of reasoning (10 categories, 634 samples), hallucination types, and gains from multi-turn interactivity; evaluation code released
Key Findings
ChatGPT often outperforms prior zero-shot LLMs.
ChatGPT can beat some fully fine-tuned models on certain tasks.
Reasoning is inconsistent and unreliable overall.
ChatGPT frequently hallucinates extrinsic facts.
Language generation degrades on low-resource and non-Latin-script targets.
Multi-turn interaction measurably improves outputs.
ChatGPT can produce simple multimodal artifacts via code as an intermediate form.
Evaluation sample sizes are small and UI-version dependent.
Results
Summarization ROUGE-1
Machine Translation ChrF++ (XXX→Eng)
Machine Translation ChrF++ (Eng→LRL)
Sentiment Analysis Macro F1 (NusaX Eng)
Accuracy
Misinformation detection (COVID-scientific)
Interactivity gain (summarization)
Flag drawing (A-grade, 0 errors)
Who Should Care
What To Try In 7 Days
Run ChatGPT zero-shot on your core text tasks as a baseline and compare to current models (use 30–200 held-out samples)
Add a short multi-turn refinement step (one follow-up prompt) for summarization and translation to capture ~8 ROUGE-1 / ~2 ChrF++ gains
For non-English targets, test native-speaker validation and consider post-editing pipelines before deployment
Agent Features
Memory
- short-term multi-turn context (dialog history)
- no reliable long-term belief-state retention across turns
Tool Use
- code generation as multimodal bridge (SVG/Canvas code output)
Frameworks
- RLHF (used for alignment)
Architectures
- Transformer (GPT family)
- RLHF-fine-tuned dialog model
Collaboration
- human-in-the-loop editing; multi-turn prompt post-editing
Optimization Features
Training Optimization
- RL
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small per-task samples (30–200) limit statistical strength of some claims
- Experiments use web UI (Dec 15, 2022) not API; results may differ on newer model versions
- Automatic metrics undercount quality for open responses; authors note human judgments sometimes diverge
- Multimodal test uses code-mediated generation (SVG), not native vision-language models
When Not To Use
- High-stakes decision-making or scientific claims without external verification
- Production machine translation into extremely low-resource or non-Latin scripts without native post-editing
- Tasks requiring reliable multi-hop, mathematical, or spatial reasoning
Failure Modes
- Extrinsic hallucination: plausible but unverifiable or false facts injected into outputs
- Lazy or incomplete reasoning: fails inductive or multi-hop steps unless prompted to 'do reasonable inference'
- Belief-state drift in multi-turn task-oriented dialogue unless user explicitly restates earlier constraints
- Poor generation quality for non-Latin scripts and extremely low-resource languages
Core Entities
Models
- ChatGPT (Dec 15 2022 UI)
- GPT-4
- InstructGPT
- text-davinci-002
- text-davinci-003
- BART (fine-tuned baselines)
- NLLB-200
Metrics
- ROUGE-1
- ROUGE-2
- ChrF++
- BLEU
- Macro F1
- JGA (joint goal acc)
- AUC
- HTER
- SacreBLEU
- METEOR
- BERTScore
Datasets
- CNN/DailyMail
- SAMSum
- FLoRes-200
- NusaX
- bAbI
- EntailmentBank
- CLUTRR
- MATH
- TimeDial
- StepGame
- SpartQA
- CommonsenseQA
- PiQA
- Pep-3k
- HotpotQA
- OpenDialKG
- MultiWOZ2.2
- COVID-Social
- COVID-Scientific
- TruthfulQA
- Flag drawing (50 flags)
Benchmarks
- FLoRes-200
- HotpotQA
- CommonsenseQA
- MATH
- SAMSum/CNN-DM
Context Entities
Models
- GPT-3 family
- ST-MoE-32B
- XLM-R
- NLLB
Datasets
- BIG-Bench
- AI LM Harness
- HELM

