Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

April 12, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper uses many public datasets and clear baselines to show consistent, large accuracy gaps in zero-shot ChatGPT for semantic multilingual tasks; results are strong evidence for practical recommendations.

Citations51

Evidence Strength0.90

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 40%

Authors

Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Huu Nguyen

Links

Abstract / PDF / Data

Why It Matters For Business

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Who Should Care

Summary TLDR

This paper runs a broad, zero-shot evaluation of ChatGPT across 7 NLP tasks and 34–37 languages. ChatGPT does well at POS tagging (a syntactic task) and often beats a supervised baseline there, but it performs substantially worse than task-specific supervised models on semantic tasks like NER, relation extraction, NLI, QA, commonsense reasoning, and summarization. English prompts usually produce better results than prompts in the target language. The authors recommend task-specific or smaller supervised models for production multilingual systems.

Problem Statement

Can ChatGPT, trained on mixed-language web data, be used reliably for real NLP tasks across many non-English languages, or do we still need task- and language-specific models? The paper measures zero-shot ChatGPT performance on multiple tasks and resource levels to answer this.

Main Contribution

A broad, zero-shot evaluation of ChatGPT on 7 tasks (POS, NER, Relation Extraction, NLI, QA, Commonsense Reasoning, Summarization) across 34–37 languages and multiple resource levels.

Direct comparison with state-of-the-art supervised multilingual models on public datasets.

Key Findings

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

NumbersXNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

Practical UseDo not rely on zero-shot ChatGPT for cross-lingual semantic tasks in production; fine-tuned multilingual models give far better accuracy on evaluated benchmarks.

Evidence RefTable 6

POS tagging is an important exception: ChatGPT matches or exceeds supervised models on many languages.

NumbersXGLUE-POS avg acc: ChatGPT (en) 84.5% vs XLM-R 79.3%; ChatGPT outperformed XLM-R on 13/17 languages

Practical UseFor syntactic preprocessing like POS tagging, try zero-shot ChatGPT as a fast option; still validate per language.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyChatGPT (en) 84.5% vs XLM-R 79.3%XLM-R (supervised) 79.3% avg+5.2 ppXGLUE-POS (test sets)ChatGPT outperformed XLM-R on 13/17 languages; Table 2Table 2
NER F1 (average)ChatGPT (en) 29.9% vs DAMO 88.4%DAMO (supervised) 88.4% avg-58.5 ppMultiCoNER (test sets)ChatGPT F1 <40% on all languages; high spurious prediction rates; Table 3 and 4Table 3; Table 4

What To Try In 7 Days

Run quick A/B: ChatGPT zero-shot vs a small fine-tuned model on your target language and task to measure gaps.

If you need sequence labeling (NER/RE), test a supervised model with CRF and small labeled set rather than ChatGPT.

When using ChatGPT, try English task descriptions and measure improvement over target-language prompts.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

XGLUE-POS (HuggingFace Datasets)MultiCoNERSMiLERXNLIXQuADX-CSQAIndicNLPSuiteXL-Sum

Risks & Boundaries

Limitations

Only zero-shot ChatGPT was evaluated; no fine-tuning or few-shot chains-of-thought experiments were run.

Coverage is broad but not exhaustive: many languages and tasks remain untested.

When Not To Use

Named entity recognition in noisy/ambiguous text.

Relation extraction and other fine-grained IE tasks where supervised models excel.

Failure Modes

High spurious/verbose outputs for NER (many extra predicted entities).

Strong bias toward English: English prompts often give better results than target-language prompts.

Core Entities

Models

ChatGPTXLM-RoBERTa (XLM-R)mT5-XXLmT5-ILDAMOTRTIndicBERTBLOOM

Metrics

AccuracyF1micro F1macro F1exact match (EM)ROUGE-1ROUGE-2ROUGE-Lspurious prediction %

Datasets

XGLUE-POSMultiCoNERSMiLERXNLIXQuADX-CSQAIndicNLPSuite (Wikipedia Cloze QA)XL-Sum

Benchmarks

XGLUEMultiCoNERSMiLERXNLIXQuADX-CSQAXL-Sum