Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

July 3, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

2

Authors

Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet Üstün, Sara Hooker, Sebastian Ruder

Links

Abstract / PDF

Why It Matters For Business

Quantization cuts serving cost but can reduce perceived quality for many languages and for hard tasks; businesses must validate quantized models with human tests in their target languages to avoid degrading user experience.

Summary TLDR

This empirical study measures how post-training quantization (4/8-bit, weight-only and weight+activation) affects multilingual large language models (Command R/R+ and Aya 23). Automatic benchmarks show small average drops, but human raters find much larger quality losses, especially for non-Latin scripts and hard tasks like math. Mitigations (group-wise scaling, SmoothQuant) help but do not fully eliminate harms. The paper urges human-centered multilingual testing before deploying quantized models globally.

Problem Statement

Quantization is widely used to cut model cost, but prior work mainly evaluates in English. It is unknown whether quantization preserves quality across many languages and realistic prompts. The paper asks: which languages and tasks suffer from quantization, and can common mitigation techniques help?

Main Contribution

Comprehensive multilingual evaluation of post-training quantization on four SOTA multilingual models (103B→8B) across 20+ languages.

Contrast of automatic metrics, LLM/RM-as-a-Judge, and human evaluation to show automatic metrics understate harms.

Analysis of which languages and tasks are most affected and an empirical look at mitigation strategies (group-wise scaling, SmoothQuant).

Key Findings

Automatic metrics understate human-observed quality drops.

NumbersJapanese: -1.7% auto vs -16.0% human

Non-Latin script languages suffer more from quantization.

Numbers103B W4: Latin avg -0.7% vs non-Latin -1.9% (auto)

Hard reasoning tasks degrade fastest under quantization.

Numbers35B W4-g on MGSM: -13.1% (accuracy)

Mitigations partially recover quality but have trade-offs.

NumbersGroup-wise scaling recovers >6 percentage points on MGSM for Ltn/IE

Sometimes quantization improves performance.

NumbersAya/35B W8A8: +1.3% avg on some tasks

Results

Human-evaluated quality (Internal suite)

ValueW4-g average -10.5% vs FP16

BaselineFP16

Automatic vs human mismatch (Japanese)

Valueauto -1.7% vs human -16.0%

BaselineFP16

Math reasoning (MGSM)

Value35B W4-g: -13.1% accuracy

BaselineFP16

Non-Latin vs Latin avg (103B W4)

ValueLatin -0.7% vs non-Latin -1.9% (avg automatic)

BaselineFP16

Mitigation impact (group-wise scaling)

ValueRecovers >6 percentage points on MGSM for Ltn/IE

Baselinenaive per-column W4

Occasional positive effect

ValueAya/35B W8A8: +1.3% avg on tasks

BaselineFP16

Who Should Care

What To Try In 7 Days

Run human pairwise checks on 10–20 realistic prompts per target language after quantizing.

Compare W8, W8A8, and W4 variants; measure both automatic and human win-rates.

Test group-wise scaling and SmoothQuant on a small calibration set and measure MGSM (math) and language-specific dropouts.

Optimization Features

Model Optimization

  • Weight-only quantization (W8, W4, W4-g)
  • Weight-and-activation quantization (W8A8)
  • Per-column vs group-wise scaling
  • Quantile quantization (NF4)

System Optimization

  • SmoothQuant to smooth activation distributions for better W8A8 behavior
  • bitsandbytes LLM.int8() for mixed FP16/8 execution

Training Optimization

  • Post-training quantization (PTQ) focus; no QAT experiments

Inference Optimization

  • W8A8 enables low-precision matrix multiply hardware; up to ~2× throughput
  • Weight-only reduces model memory footprint (≈2× for 8-bit, ≈4× for 4-bit)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Study only runs on two model families (Command R/R+ and Aya 23); generality to other families is suggested but not proven.
  • Human evaluation covers a subset of languages (English, Spanish, French, Korean, Japanese); under-represented languages may see larger harms.
  • Training mixes are not public, so correlations with exact training data sizes are inferred from proxies (mC4).
  • Calibration for quantization used English samples, which may bias mitigation effectiveness for other languages.

When Not To Use

  • Don’t deploy aggressively quantized (4-bit) models for user-facing multilingual products without human checks.
  • Avoid 4-bit quantization for math/reasoning workloads or safety‑critical multilingual services.
  • Skip naive per-column 4-bit quantization for non-Latin languages without group-wise scaling or further tuning.

Failure Modes

  • Large human-perceived quality drops not visible in automatic metrics.
  • Disparate language regressions: non-Latin scripts (ja/ko/zh) degrade more.
  • Math and other reasoning failures increase with lower bit-width.
  • Automated judges (LLM/RM) can disagree with human raters in some settings.

Core Entities

Models

  • Command R+ (103B)
  • Command R (35B)
  • Aya 23 (35B)
  • Aya 23 (8B)

Metrics

  • Accuracy
  • SacreBLEU
  • Line-level pass rate (LPR)
  • Human win-rate
  • LLM/RM win-rate

Datasets

  • mMMLU
  • MGSM
  • FLORES-200
  • Language Confusion
  • Belebele
  • Aya Dolly-200
  • Internal Evaluation Suite

Benchmarks

  • mMMLU
  • MGSM
  • FLORES-200
  • Language Confusion
  • Belebele

Context Entities

Models

  • LLaMA-style baselines (context in related work)

Datasets

  • mC4 (used to proxy language data size)