Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Overview

Decision SnapshotNeeds Validation

The study provides consistent multi-method evidence (automatic, LLM/RM judges, and humans) showing quantization harms are real for multilingual settings; results are strongest for the evaluated models and languages.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet Üstün, Sara Hooker, Sebastian Ruder

Links

Abstract / PDF

Why It Matters For Business

Quantization cuts serving cost but can reduce perceived quality for many languages and for hard tasks; businesses must validate quantized models with human tests in their target languages to avoid degrading user experience.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This empirical study measures how post-training quantization (4/8-bit, weight-only and weight+activation) affects multilingual large language models (Command R/R+ and Aya 23). Automatic benchmarks show small average drops, but human raters find much larger quality losses, especially for non-Latin scripts and hard tasks like math. Mitigations (group-wise scaling, SmoothQuant) help but do not fully eliminate harms. The paper urges human-centered multilingual testing before deploying quantized models globally.

Problem Statement

Quantization is widely used to cut model cost, but prior work mainly evaluates in English. It is unknown whether quantization preserves quality across many languages and realistic prompts. The paper asks: which languages and tasks suffer from quantization, and can common mitigation techniques help?

Main Contribution

Comprehensive multilingual evaluation of post-training quantization on four SOTA multilingual models (103B→8B) across 20+ languages.

Contrast of automatic metrics, LLM/RM-as-a-Judge, and human evaluation to show automatic metrics understate harms.

Key Findings

Automatic metrics understate human-observed quality drops.

NumbersJapanese: -1.7% auto vs -16.0% human

Practical UseDon’t rely on automatic benchmarks alone; run human checks for target languages before shipping quantized models.

Evidence RefAbstract; Fig.1; §4.7; Table 8

Non-Latin script languages suffer more from quantization.

Numbers103B W4: Latin avg -0.7% vs non-Latin -1.9% (auto)

Practical UseExpect larger regressions for Chinese/Japanese/Korean; test those languages explicitly after quantizing.

Evidence Ref§4.3; Table 3; Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human-evaluated quality (Internal suite)	W4-g average -10.5% vs FP16	FP16	-10.5%	Internal Evaluation Suite (human annotation)	Table 8; §4.7	Table 8
Automatic vs human mismatch (Japanese)	auto -1.7% vs human -16.0%	FP16	auto -1.7% / human -16.0%	mMMLU/MGSM/FLORES vs Internal human prompts	Abstract; Fig.1; §4.7	Fig.1; Table 8

What To Try In 7 Days

Run human pairwise checks on 10–20 realistic prompts per target language after quantizing.

Compare W8, W8A8, and W4 variants; measure both automatic and human win-rates.

Test group-wise scaling and SmoothQuant on a small calibration set and measure MGSM (math) and language-specific dropouts.

Optimization Features

Model Optimization

Weight-only quantization (W8, W4, W4-g)Weight-and-activation quantization (W8A8)Per-column vs group-wise scalingQuantile quantization (NF4)

System Optimization

SmoothQuant to smooth activation distributions for better W8A8 behaviorbitsandbytes LLM.int8() for mixed FP16/8 execution

Training Optimization

Post-training quantization (PTQ) focus; no QAT experiments

Inference Optimization

W8A8 enables low-precision matrix multiply hardware; up to ~2× throughputWeight-only reduces model memory footprint (≈2× for 8-bit, ≈4× for 4-bit)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Study only runs on two model families (Command R/R+ and Aya 23); generality to other families is suggested but not proven.

Human evaluation covers a subset of languages (English, Spanish, French, Korean, Japanese); under-represented languages may see larger harms.

When Not To Use

Don’t deploy aggressively quantized (4-bit) models for user-facing multilingual products without human checks.

Avoid 4-bit quantization for math/reasoning workloads or safety‑critical multilingual services.

Failure Modes

Large human-perceived quality drops not visible in automatic metrics.

Disparate language regressions: non-Latin scripts (ja/ko/zh) degrade more.

Core Entities

Models

Command R+ (103B)Command R (35B)Aya 23 (35B)Aya 23 (8B)

Metrics

AccuracySacreBLEULine-level pass rate (LPR)Human win-rateLLM/RM win-rate

Datasets

mMMLUMGSMFLORES-200Language ConfusionBelebeleAya Dolly-200Internal Evaluation Suite

Benchmarks

mMMLUMGSMFLORES-200Language ConfusionBelebele

Context Entities

Models

LLaMA-style baselines (context in related work)

Datasets

mC4 (used to proxy language data size)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Automatic metrics understate human-observed quality drops.

Non-Latin script languages suffer more from quantization.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Survey: how to run reasoning-capable LLMs and autonomous agents on memory- and power-limited edge devices

Key finding