Overview
The study provides consistent multi-method evidence (automatic, LLM/RM judges, and humans) showing quantization harms are real for multilingual settings; results are strongest for the evaluated models and languages.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Quantization cuts serving cost but can reduce perceived quality for many languages and for hard tasks; businesses must validate quantized models with human tests in their target languages to avoid degrading user experience.
Who Should Care
Summary TLDR
This empirical study measures how post-training quantization (4/8-bit, weight-only and weight+activation) affects multilingual large language models (Command R/R+ and Aya 23). Automatic benchmarks show small average drops, but human raters find much larger quality losses, especially for non-Latin scripts and hard tasks like math. Mitigations (group-wise scaling, SmoothQuant) help but do not fully eliminate harms. The paper urges human-centered multilingual testing before deploying quantized models globally.
Problem Statement
Quantization is widely used to cut model cost, but prior work mainly evaluates in English. It is unknown whether quantization preserves quality across many languages and realistic prompts. The paper asks: which languages and tasks suffer from quantization, and can common mitigation techniques help?
Main Contribution
Comprehensive multilingual evaluation of post-training quantization on four SOTA multilingual models (103B→8B) across 20+ languages.
Contrast of automatic metrics, LLM/RM-as-a-Judge, and human evaluation to show automatic metrics understate harms.
Key Findings
Automatic metrics understate human-observed quality drops.
Non-Latin script languages suffer more from quantization.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Human-evaluated quality (Internal suite) | W4-g average -10.5% vs FP16 | FP16 | -10.5% | Internal Evaluation Suite (human annotation) | Table 8; §4.7 | Table 8 |
| Automatic vs human mismatch (Japanese) | auto -1.7% vs human -16.0% | FP16 | auto -1.7% / human -16.0% | mMMLU/MGSM/FLORES vs Internal human prompts | Abstract; Fig.1; §4.7 | Fig.1; Table 8 |
What To Try In 7 Days
Run human pairwise checks on 10–20 realistic prompts per target language after quantizing.
Compare W8, W8A8, and W4 variants; measure both automatic and human win-rates.
Test group-wise scaling and SmoothQuant on a small calibration set and measure MGSM (math) and language-specific dropouts.
Optimization Features
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Study only runs on two model families (Command R/R+ and Aya 23); generality to other families is suggested but not proven.
Human evaluation covers a subset of languages (English, Spanish, French, Korean, Japanese); under-represented languages may see larger harms.
When Not To Use
Don’t deploy aggressively quantized (4-bit) models for user-facing multilingual products without human checks.
Avoid 4-bit quantization for math/reasoning workloads or safety‑critical multilingual services.
Failure Modes
Large human-perceived quality drops not visible in automatic metrics.
Disparate language regressions: non-Latin scripts (ja/ko/zh) degrade more.

