Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

July 3, 20247 min

Overview

Decision SnapshotNeeds Validation

The study provides consistent multi-method evidence (automatic, LLM/RM judges, and humans) showing quantization harms are real for multilingual settings; results are strongest for the evaluated models and languages.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet Üstün, Sara Hooker, Sebastian Ruder

Links

Abstract / PDF

Why It Matters For Business

Quantization cuts serving cost but can reduce perceived quality for many languages and for hard tasks; businesses must validate quantized models with human tests in their target languages to avoid degrading user experience.

Who Should Care

Summary TLDR

This empirical study measures how post-training quantization (4/8-bit, weight-only and weight+activation) affects multilingual large language models (Command R/R+ and Aya 23). Automatic benchmarks show small average drops, but human raters find much larger quality losses, especially for non-Latin scripts and hard tasks like math. Mitigations (group-wise scaling, SmoothQuant) help but do not fully eliminate harms. The paper urges human-centered multilingual testing before deploying quantized models globally.

Problem Statement

Quantization is widely used to cut model cost, but prior work mainly evaluates in English. It is unknown whether quantization preserves quality across many languages and realistic prompts. The paper asks: which languages and tasks suffer from quantization, and can common mitigation techniques help?

Main Contribution

Comprehensive multilingual evaluation of post-training quantization on four SOTA multilingual models (103B→8B) across 20+ languages.

Contrast of automatic metrics, LLM/RM-as-a-Judge, and human evaluation to show automatic metrics understate harms.

Key Findings

Automatic metrics understate human-observed quality drops.

NumbersJapanese: -1.7% auto vs -16.0% human

Practical UseDon’t rely on automatic benchmarks alone; run human checks for target languages before shipping quantized models.

Evidence RefAbstract; Fig.1; §4.7; Table 8

Non-Latin script languages suffer more from quantization.

Numbers103B W4: Latin avg -0.7% vs non-Latin -1.9% (auto)

Practical UseExpect larger regressions for Chinese/Japanese/Korean; test those languages explicitly after quantizing.

Evidence Ref§4.3; Table 3; Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human-evaluated quality (Internal suite)W4-g average -10.5% vs FP16FP16-10.5%Internal Evaluation Suite (human annotation)Table 8; §4.7Table 8
Automatic vs human mismatch (Japanese)auto -1.7% vs human -16.0%FP16auto -1.7% / human -16.0%mMMLU/MGSM/FLORES vs Internal human promptsAbstract; Fig.1; §4.7Fig.1; Table 8

What To Try In 7 Days

Run human pairwise checks on 10–20 realistic prompts per target language after quantizing.

Compare W8, W8A8, and W4 variants; measure both automatic and human win-rates.

Test group-wise scaling and SmoothQuant on a small calibration set and measure MGSM (math) and language-specific dropouts.

Optimization Features

Model Optimization
Weight-only quantization (W8, W4, W4-g)Weight-and-activation quantization (W8A8)Per-column vs group-wise scalingQuantile quantization (NF4)
System Optimization
SmoothQuant to smooth activation distributions for better W8A8 behaviorbitsandbytes LLM.int8() for mixed FP16/8 execution
Training Optimization
Post-training quantization (PTQ) focus; no QAT experiments
Inference Optimization
W8A8 enables low-precision matrix multiply hardware; up to ~2× throughputWeight-only reduces model memory footprint (≈2× for 8-bit, ≈4× for 4-bit)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Study only runs on two model families (Command R/R+ and Aya 23); generality to other families is suggested but not proven.

Human evaluation covers a subset of languages (English, Spanish, French, Korean, Japanese); under-represented languages may see larger harms.

When Not To Use

Don’t deploy aggressively quantized (4-bit) models for user-facing multilingual products without human checks.

Avoid 4-bit quantization for math/reasoning workloads or safety‑critical multilingual services.

Failure Modes

Large human-perceived quality drops not visible in automatic metrics.

Disparate language regressions: non-Latin scripts (ja/ko/zh) degrade more.

Core Entities

Models

Command R+ (103B)Command R (35B)Aya 23 (35B)Aya 23 (8B)

Metrics

AccuracySacreBLEULine-level pass rate (LPR)Human win-rateLLM/RM win-rate

Datasets

mMMLUMGSMFLORES-200Language ConfusionBelebeleAya Dolly-200Internal Evaluation Suite

Benchmarks

mMMLUMGSMFLORES-200Language ConfusionBelebele

Context Entities

Models

LLaMA-style baselines (context in related work)

Datasets

mC4 (used to proxy language data size)