Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

April 12, 20238 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

51

Authors

Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Huu Nguyen

Links

Abstract / PDF

Why It Matters For Business

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Summary TLDR

This paper runs a broad, zero-shot evaluation of ChatGPT across 7 NLP tasks and 34–37 languages. ChatGPT does well at POS tagging (a syntactic task) and often beats a supervised baseline there, but it performs substantially worse than task-specific supervised models on semantic tasks like NER, relation extraction, NLI, QA, commonsense reasoning, and summarization. English prompts usually produce better results than prompts in the target language. The authors recommend task-specific or smaller supervised models for production multilingual systems.

Problem Statement

Can ChatGPT, trained on mixed-language web data, be used reliably for real NLP tasks across many non-English languages, or do we still need task- and language-specific models? The paper measures zero-shot ChatGPT performance on multiple tasks and resource levels to answer this.

Main Contribution

A broad, zero-shot evaluation of ChatGPT on 7 tasks (POS, NER, Relation Extraction, NLI, QA, Commonsense Reasoning, Summarization) across 34–37 languages and multiple resource levels.

Direct comparison with state-of-the-art supervised multilingual models on public datasets.

Analysis of prompt language (English vs target language), success rates, and qualitative failure modes (verbosity, spurious outputs, English bias).

Key Findings

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

NumbersXNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

POS tagging is an important exception: ChatGPT matches or exceeds supervised models on many languages.

NumbersXGLUE-POS avg acc: ChatGPT (en) 84.5% vs XLM-R 79.3%; ChatGPT outperformed XLM-R on 13/17 languages

Named entity extraction is particularly weak and noisy for ChatGPT.

NumbersMultiCoNER avg F1: ChatGPT (en) 29.9% vs DAMO 88.4%; per-type spurious rates up to 57%

Prompting in English usually yields better zero-shot results than using the target language.

NumbersAcross tasks (XNLI, XQuAD, X-CSQA) target-language prompts often drop accuracy by ~10+ points vs English prompts

Summaries from ChatGPT tend to be much longer than human references and score poorly on ROUGE.

NumbersXL-Sum English: ChatGPT ROUGE-L 13.38 vs mT5-XXL 32.51; avg model summary length 612 chars vs gold 126 chars

Results

Accuracy

ValueChatGPT (en) 84.5% vs XLM-R 79.3%

BaselineXLM-R (supervised) 79.3% avg

NER F1 (average)

ValueChatGPT (en) 29.9% vs DAMO 88.4%

BaselineDAMO (supervised) 88.4% avg

Relation Extraction micro-F1 (average)

ValueChatGPT (en) 69.4% vs mT5-IL 85.0%

BaselinemT5-IL (supervised) 85.0% avg

Accuracy

ValueChatGPT (en) 57.0% vs mT5-XXL 87.1%

BaselinemT5-XXL (supervised) 87.1% avg

QA (XQuAD) avg EM / F1

ValueChatGPT (en) EM 35.6% / F1 53.5% vs mT5-XXL EM 71.3% / F1 85.2%

BaselinemT5-XXL (supervised) EM 71.3% F1 85.2%

Accuracy

ValueChatGPT (en) 47.8% vs TRT 59.0%

BaselineTRT (supervised) 59.0%

Summarization ROUGE-L (English)

ValueChatGPT ROUGE-L 13.38 vs mT5-XXL 32.51

BaselinemT5-XXL (supervised) ROUGE-L 32.51 (English)

Who Should Care

What To Try In 7 Days

Run quick A/B: ChatGPT zero-shot vs a small fine-tuned model on your target language and task to measure gaps.

If you need sequence labeling (NER/RE), test a supervised model with CRF and small labeled set rather than ChatGPT.

When using ChatGPT, try English task descriptions and measure improvement over target-language prompts.

Reproducibility

Data Urls

  • XGLUE-POS (HuggingFace Datasets)
  • MultiCoNER
  • SMiLER
  • XNLI
  • XQuAD
  • X-CSQA
  • IndicNLPSuite
  • XL-Sum

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only zero-shot ChatGPT was evaluated; no fine-tuning or few-shot chains-of-thought experiments were run.
  • Coverage is broad but not exhaustive: many languages and tasks remain untested.
  • Prompt design was intentionally simple (single-stage); alternative prompting may change results.
  • Model versions beyond GPT-3.5-era ChatGPT (e.g., GPT-4, BLOOM large variants) were not included.

When Not To Use

  • Named entity recognition in noisy/ambiguous text.
  • Relation extraction and other fine-grained IE tasks where supervised models excel.
  • Production QA, NLI, commonsense reasoning across non-English languages without validation.
  • Abstractive summarization when concise, factual summaries are required.

Failure Modes

  • High spurious/verbose outputs for NER (many extra predicted entities).
  • Strong bias toward English: English prompts often give better results than target-language prompts.
  • Long and unfocused summaries that hurt ROUGE scores.
  • Wide per-language variance—some low-resource languages unexpectedly perform better or worse.

Core Entities

Models

  • ChatGPT
  • XLM-RoBERTa (XLM-R)
  • mT5-XXL
  • mT5-IL
  • DAMO
  • TRT
  • IndicBERT
  • BLOOM

Metrics

  • Accuracy
  • F1
  • micro F1
  • macro F1
  • exact match (EM)
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • spurious prediction %

Datasets

  • XGLUE-POS
  • MultiCoNER
  • SMiLER
  • XNLI
  • XQuAD
  • X-CSQA
  • IndicNLPSuite (Wikipedia Cloze QA)
  • XL-Sum

Benchmarks

  • XGLUE
  • MultiCoNER
  • SMiLER
  • XNLI
  • XQuAD
  • X-CSQA
  • XL-Sum