Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
Quantization cuts serving cost but can reduce perceived quality for many languages and for hard tasks; businesses must validate quantized models with human tests in their target languages to avoid degrading user experience.
Summary TLDR
This empirical study measures how post-training quantization (4/8-bit, weight-only and weight+activation) affects multilingual large language models (Command R/R+ and Aya 23). Automatic benchmarks show small average drops, but human raters find much larger quality losses, especially for non-Latin scripts and hard tasks like math. Mitigations (group-wise scaling, SmoothQuant) help but do not fully eliminate harms. The paper urges human-centered multilingual testing before deploying quantized models globally.
Problem Statement
Quantization is widely used to cut model cost, but prior work mainly evaluates in English. It is unknown whether quantization preserves quality across many languages and realistic prompts. The paper asks: which languages and tasks suffer from quantization, and can common mitigation techniques help?
Main Contribution
Comprehensive multilingual evaluation of post-training quantization on four SOTA multilingual models (103B→8B) across 20+ languages.
Contrast of automatic metrics, LLM/RM-as-a-Judge, and human evaluation to show automatic metrics understate harms.
Analysis of which languages and tasks are most affected and an empirical look at mitigation strategies (group-wise scaling, SmoothQuant).
Key Findings
Automatic metrics understate human-observed quality drops.
Non-Latin script languages suffer more from quantization.
Hard reasoning tasks degrade fastest under quantization.
Mitigations partially recover quality but have trade-offs.
Sometimes quantization improves performance.
Results
Human-evaluated quality (Internal suite)
Automatic vs human mismatch (Japanese)
Math reasoning (MGSM)
Non-Latin vs Latin avg (103B W4)
Mitigation impact (group-wise scaling)
Occasional positive effect
Who Should Care
What To Try In 7 Days
Run human pairwise checks on 10–20 realistic prompts per target language after quantizing.
Compare W8, W8A8, and W4 variants; measure both automatic and human win-rates.
Test group-wise scaling and SmoothQuant on a small calibration set and measure MGSM (math) and language-specific dropouts.
Optimization Features
Model Optimization
- Weight-only quantization (W8, W4, W4-g)
- Weight-and-activation quantization (W8A8)
- Per-column vs group-wise scaling
- Quantile quantization (NF4)
System Optimization
- SmoothQuant to smooth activation distributions for better W8A8 behavior
- bitsandbytes LLM.int8() for mixed FP16/8 execution
Training Optimization
- Post-training quantization (PTQ) focus; no QAT experiments
Inference Optimization
- W8A8 enables low-precision matrix multiply hardware; up to ~2× throughput
- Weight-only reduces model memory footprint (≈2× for 8-bit, ≈4× for 4-bit)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Study only runs on two model families (Command R/R+ and Aya 23); generality to other families is suggested but not proven.
- Human evaluation covers a subset of languages (English, Spanish, French, Korean, Japanese); under-represented languages may see larger harms.
- Training mixes are not public, so correlations with exact training data sizes are inferred from proxies (mC4).
- Calibration for quantization used English samples, which may bias mitigation effectiveness for other languages.
When Not To Use
- Don’t deploy aggressively quantized (4-bit) models for user-facing multilingual products without human checks.
- Avoid 4-bit quantization for math/reasoning workloads or safety‑critical multilingual services.
- Skip naive per-column 4-bit quantization for non-Latin languages without group-wise scaling or further tuning.
Failure Modes
- Large human-perceived quality drops not visible in automatic metrics.
- Disparate language regressions: non-Latin scripts (ja/ko/zh) degrade more.
- Math and other reasoning failures increase with lower bit-width.
- Automated judges (LLM/RM) can disagree with human raters in some settings.
Core Entities
Models
- Command R+ (103B)
- Command R (35B)
- Aya 23 (35B)
- Aya 23 (8B)
Metrics
- Accuracy
- SacreBLEU
- Line-level pass rate (LPR)
- Human win-rate
- LLM/RM win-rate
Datasets
- mMMLU
- MGSM
- FLORES-200
- Language Confusion
- Belebele
- Aya Dolly-200
- Internal Evaluation Suite
Benchmarks
- mMMLU
- MGSM
- FLORES-200
- Language Confusion
- Belebele
Context Entities
Models
- LLaMA-style baselines (context in related work)
Datasets
- mC4 (used to proxy language data size)

