Overview
The paper uses many public datasets and clear baselines to show consistent, large accuracy gaps in zero-shot ChatGPT for semantic multilingual tasks; results are strong evidence for practical recommendations.
Citations51
Evidence Strength0.90
Confidence0.80
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 7/7
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.
Who Should Care
Summary TLDR
This paper runs a broad, zero-shot evaluation of ChatGPT across 7 NLP tasks and 34–37 languages. ChatGPT does well at POS tagging (a syntactic task) and often beats a supervised baseline there, but it performs substantially worse than task-specific supervised models on semantic tasks like NER, relation extraction, NLI, QA, commonsense reasoning, and summarization. English prompts usually produce better results than prompts in the target language. The authors recommend task-specific or smaller supervised models for production multilingual systems.
Problem Statement
Can ChatGPT, trained on mixed-language web data, be used reliably for real NLP tasks across many non-English languages, or do we still need task- and language-specific models? The paper measures zero-shot ChatGPT performance on multiple tasks and resource levels to answer this.
Main Contribution
A broad, zero-shot evaluation of ChatGPT on 7 tasks (POS, NER, Relation Extraction, NLI, QA, Commonsense Reasoning, Summarization) across 34–37 languages and multiple resource levels.
Direct comparison with state-of-the-art supervised multilingual models on public datasets.
Key Findings
ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.
POS tagging is an important exception: ChatGPT matches or exceeds supervised models on many languages.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ChatGPT (en) 84.5% vs XLM-R 79.3% | XLM-R (supervised) 79.3% avg | +5.2 pp | XGLUE-POS (test sets) | ChatGPT outperformed XLM-R on 13/17 languages; Table 2 | Table 2 |
| NER F1 (average) | ChatGPT (en) 29.9% vs DAMO 88.4% | DAMO (supervised) 88.4% avg | -58.5 pp | MultiCoNER (test sets) | ChatGPT F1 <40% on all languages; high spurious prediction rates; Table 3 and 4 | Table 3; Table 4 |
What To Try In 7 Days
Run quick A/B: ChatGPT zero-shot vs a small fine-tuned model on your target language and task to measure gaps.
If you need sequence labeling (NER/RE), test a supervised model with CRF and small labeled set rather than ChatGPT.
When using ChatGPT, try English task descriptions and measure improvement over target-language prompts.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only zero-shot ChatGPT was evaluated; no fine-tuning or few-shot chains-of-thought experiments were run.
Coverage is broad but not exhaustive: many languages and tasks remain untested.
When Not To Use
Named entity recognition in noisy/ambiguous text.
Relation extraction and other fine-grained IE tasks where supervised models excel.
Failure Modes
High spurious/verbose outputs for NER (many extra predicted entities).
Strong bias toward English: English prompts often give better results than target-language prompts.

