Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

Overview

Decision SnapshotNeeds Validation

The paper uses many public datasets and clear baselines to show consistent, large accuracy gaps in zero-shot ChatGPT for semantic multilingual tasks; results are strong evidence for practical recommendations.

Citations51

Evidence Strength0.90

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 40%

Authors

Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Huu Nguyen

Links

Abstract / PDF / Data

Why It Matters For Business

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper runs a broad, zero-shot evaluation of ChatGPT across 7 NLP tasks and 34–37 languages. ChatGPT does well at POS tagging (a syntactic task) and often beats a supervised baseline there, but it performs substantially worse than task-specific supervised models on semantic tasks like NER, relation extraction, NLI, QA, commonsense reasoning, and summarization. English prompts usually produce better results than prompts in the target language. The authors recommend task-specific or smaller supervised models for production multilingual systems.

Problem Statement

Can ChatGPT, trained on mixed-language web data, be used reliably for real NLP tasks across many non-English languages, or do we still need task- and language-specific models? The paper measures zero-shot ChatGPT performance on multiple tasks and resource levels to answer this.

Main Contribution

A broad, zero-shot evaluation of ChatGPT on 7 tasks (POS, NER, Relation Extraction, NLI, QA, Commonsense Reasoning, Summarization) across 34–37 languages and multiple resource levels.

Direct comparison with state-of-the-art supervised multilingual models on public datasets.

Key Findings

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

NumbersXNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

Practical UseDo not rely on zero-shot ChatGPT for cross-lingual semantic tasks in production; fine-tuned multilingual models give far better accuracy on evaluated benchmarks.

Evidence RefTable 6

POS tagging is an important exception: ChatGPT matches or exceeds supervised models on many languages.

NumbersXGLUE-POS avg acc: ChatGPT (en) 84.5% vs XLM-R 79.3%; ChatGPT outperformed XLM-R on 13/17 languages

Practical UseFor syntactic preprocessing like POS tagging, try zero-shot ChatGPT as a fast option; still validate per language.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	ChatGPT (en) 84.5% vs XLM-R 79.3%	XLM-R (supervised) 79.3% avg	+5.2 pp	XGLUE-POS (test sets)	ChatGPT outperformed XLM-R on 13/17 languages; Table 2	Table 2
NER F1 (average)	ChatGPT (en) 29.9% vs DAMO 88.4%	DAMO (supervised) 88.4% avg	-58.5 pp	MultiCoNER (test sets)	ChatGPT F1 <40% on all languages; high spurious prediction rates; Table 3 and 4	Table 3; Table 4

What To Try In 7 Days

Run quick A/B: ChatGPT zero-shot vs a small fine-tuned model on your target language and task to measure gaps.

If you need sequence labeling (NER/RE), test a supervised model with CRF and small labeled set rather than ChatGPT.

When using ChatGPT, try English task descriptions and measure improvement over target-language prompts.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

XGLUE-POS (HuggingFace Datasets)MultiCoNERSMiLERXNLIXQuADX-CSQAIndicNLPSuiteXL-Sum

Risks & Boundaries

Limitations

Only zero-shot ChatGPT was evaluated; no fine-tuning or few-shot chains-of-thought experiments were run.

Coverage is broad but not exhaustive: many languages and tasks remain untested.

When Not To Use

Named entity recognition in noisy/ambiguous text.

Relation extraction and other fine-grained IE tasks where supervised models excel.

Failure Modes

High spurious/verbose outputs for NER (many extra predicted entities).

Strong bias toward English: English prompts often give better results than target-language prompts.

Core Entities

Models

ChatGPTXLM-RoBERTa (XLM-R)mT5-XXLmT5-ILDAMOTRTIndicBERTBLOOM

Metrics

AccuracyF1micro F1macro F1exact match (EM)ROUGE-1ROUGE-2ROUGE-Lspurious prediction %

Datasets

XGLUE-POSMultiCoNERSMiLERXNLIXQuADX-CSQAIndicNLPSuite (Wikipedia Cloze QA)XL-Sum

Benchmarks

XGLUEMultiCoNERSMiLERXNLIXQuADX-CSQAXL-Sum

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

POS tagging is an important exception: ChatGPT matches or exceeds supervised models on many languages.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding