Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
51
Why It Matters For Business
ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.
Summary TLDR
This paper runs a broad, zero-shot evaluation of ChatGPT across 7 NLP tasks and 34–37 languages. ChatGPT does well at POS tagging (a syntactic task) and often beats a supervised baseline there, but it performs substantially worse than task-specific supervised models on semantic tasks like NER, relation extraction, NLI, QA, commonsense reasoning, and summarization. English prompts usually produce better results than prompts in the target language. The authors recommend task-specific or smaller supervised models for production multilingual systems.
Problem Statement
Can ChatGPT, trained on mixed-language web data, be used reliably for real NLP tasks across many non-English languages, or do we still need task- and language-specific models? The paper measures zero-shot ChatGPT performance on multiple tasks and resource levels to answer this.
Main Contribution
A broad, zero-shot evaluation of ChatGPT on 7 tasks (POS, NER, Relation Extraction, NLI, QA, Commonsense Reasoning, Summarization) across 34–37 languages and multiple resource levels.
Direct comparison with state-of-the-art supervised multilingual models on public datasets.
Analysis of prompt language (English vs target language), success rates, and qualitative failure modes (verbosity, spurious outputs, English bias).
Key Findings
ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.
POS tagging is an important exception: ChatGPT matches or exceeds supervised models on many languages.
Named entity extraction is particularly weak and noisy for ChatGPT.
Prompting in English usually yields better zero-shot results than using the target language.
Summaries from ChatGPT tend to be much longer than human references and score poorly on ROUGE.
Results
Accuracy
NER F1 (average)
Relation Extraction micro-F1 (average)
Accuracy
QA (XQuAD) avg EM / F1
Accuracy
Summarization ROUGE-L (English)
Who Should Care
What To Try In 7 Days
Run quick A/B: ChatGPT zero-shot vs a small fine-tuned model on your target language and task to measure gaps.
If you need sequence labeling (NER/RE), test a supervised model with CRF and small labeled set rather than ChatGPT.
When using ChatGPT, try English task descriptions and measure improvement over target-language prompts.
Reproducibility
Data Urls
- XGLUE-POS (HuggingFace Datasets)
- MultiCoNER
- SMiLER
- XNLI
- XQuAD
- X-CSQA
- IndicNLPSuite
- XL-Sum
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only zero-shot ChatGPT was evaluated; no fine-tuning or few-shot chains-of-thought experiments were run.
- Coverage is broad but not exhaustive: many languages and tasks remain untested.
- Prompt design was intentionally simple (single-stage); alternative prompting may change results.
- Model versions beyond GPT-3.5-era ChatGPT (e.g., GPT-4, BLOOM large variants) were not included.
When Not To Use
- Named entity recognition in noisy/ambiguous text.
- Relation extraction and other fine-grained IE tasks where supervised models excel.
- Production QA, NLI, commonsense reasoning across non-English languages without validation.
- Abstractive summarization when concise, factual summaries are required.
Failure Modes
- High spurious/verbose outputs for NER (many extra predicted entities).
- Strong bias toward English: English prompts often give better results than target-language prompts.
- Long and unfocused summaries that hurt ROUGE scores.
- Wide per-language variance—some low-resource languages unexpectedly perform better or worse.
Core Entities
Models
- ChatGPT
- XLM-RoBERTa (XLM-R)
- mT5-XXL
- mT5-IL
- DAMO
- TRT
- IndicBERT
- BLOOM
Metrics
- Accuracy
- F1
- micro F1
- macro F1
- exact match (EM)
- ROUGE-1
- ROUGE-2
- ROUGE-L
- spurious prediction %
Datasets
- XGLUE-POS
- MultiCoNER
- SMiLER
- XNLI
- XQuAD
- X-CSQA
- IndicNLPSuite (Wikipedia Cloze QA)
- XL-Sum
Benchmarks
- XGLUE
- MultiCoNER
- SMiLER
- XNLI
- XQuAD
- X-CSQA
- XL-Sum

