Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

March 18, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

4

Authors

Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li

Links

Abstract / PDF

Why It Matters For Business

Compression can save cost and enable deployment on consumer GPUs, but it can also change model safety in ways that accuracy tests miss. Pick compression methods and bit-rates with trust tests, not just MMLU.

Summary TLDR

The paper systematically measures how popular, training-free compression techniques change an LLM's trustworthiness across eight dimensions (toxicity, fairness, ethics, privacy, adversarial/OOD robustness, stereotypes, robustness to adversarial demonstrations) plus standard utility (MMLU). Main takeaways: weight quantization (especially 4-bit, using AWQ/GPTQ) usually preserves benign accuracy and often preserves or improves some trust metrics; structured pruning (N:M, 50%) tends to degrade trust; extreme 3-bit quantization can cause large, unpredictable safety failures (GPTQ can break instruction-following and raise toxicity). The authors release code, models, and a modified DecodingTrust/ M

Problem Statement

We lack a comprehensive, multi-dimension view of how compression changes LLM trust. Developers compress large models to save compute, but most evaluations only check accuracy or perplexity. This leaves safety risks (toxicity, privacy, fairness, robustness, ethics, etc.) unmeasured and possibly hidden.

Main Contribution

A broad, systematic benchmark of compression effects on trustworthiness: 3 popular 13B models, 5 training-free compression methods, and 8 trust metrics (DecodingTrust) plus MMLU.

Empirical finding that quantization (4-bit, AWQ/GPTQ) often preserves utility and can improve some trust metrics, while pruning at hardware-friendly patterns (2:4) often harms trust.

Practical guidance: recommended compression regimes (4-bit sweet spot), caveats about 3-bit quantization, and a short checklist for trustworthy compression.

Key Findings

4-bit post-training quantization usually preserves trustworthiness within small margins.

Numbers≤5-point drop across 8 trust metrics (LLAMA2 13b Chat, 4-bit)

Quantization can sometimes improve ethics or fairness.

NumbersEthics: 54.1 → 76.3 (GPTQ 4-bit); Fairness EOD reduced by >0.2 (few-shot)

Pruning at practical 2:4 (50%) sparsity often degrades trust and utility.

NumbersTrust drops up to ~40 points on some metrics for pruning methods at 50% sparsity

Extreme 3-bit quantization can cause catastrophic trust failures, especially with GPTQ.

NumbersToxicity/OOD drops up to ~30–50 points for GPTQ 3-bit; large variance across seeds

AWQ is more stable than GPTQ at low bits; GPTQ is sensitive to calibration randomness.

NumbersAWQ shows smaller drops and lower variance than GPTQ at 3/4 bits

Standard utility metrics (MMLU) miss many safety failures introduced by compression.

NumbersModels with small MMLU drops can still have large trust drops (up to tens of points)

Results

Trustworthiness preservation at 4-bit

Value≤5-point average drop across 8 trust metrics (LLAMA2 13b Chat, 4-bit)

Baseline13b dense

Ethics improvement (example)

Value54.1 → 76.3 (GPTQ 4-bit for LLAMA2 13b Chat)

Baseline13b dense Ethics score 54.1

Pruning harm at 50% (2:4 structured)

ValueTrust drops up to ~40 points on some metrics

Baseline13b dense

Extreme quantization worst-case degradation

ValueGPTQ 3-bit: toxicity/OOD drops ~30–50 points; AWQ less severe

Baseline13b dense

Inference speedup example

Value~3.2–3.3× speedup for 13b→4-bit (AWQ) vs FP16

BaselineHuggingface FP16 13b

Who Should Care

What To Try In 7 Days

Quantize a production 13B model to 4-bit (AWQ) and run your safety suite (toxicity, privacy, fairness, OOD).

Avoid one-shot structured pruning at 2:4 without per-metric validation; compare trust metrics before/after.

If using GPTQ, run compression with multiple calibration seeds and measure variance in trust scores.

Optimization Features

Infra Optimization

  • example speedup: 3.2–3.3× inference for 13B→4-bit AWQ vs FP16

Model Optimization

  • post-training weight quantization (GPTQ, AWQ)
  • semi-structured pruning (2:4 N:M) with SparseGPT/Wanda
  • magnitude pruning baseline

System Optimization

  • activation-aware scaling (AWQ) to preserve salient weights

Training Optimization

  • none (focus on training-free, post-training methods)

Inference Optimization

  • lower-bit weight inference (3/4/8-bit)
  • hardware-friendly N:M structured sparsity for speedups

Reproducibility

Data Urls

  • modified DecodingTrust on project website (link above)
  • MMLU public dataset

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focus on training-free, post-training compression (GPTQ, AWQ, SparseGPT, Wanda); does not evaluate distillation or retraining-based compression.
  • Most experiments center on 13B-class models; results may not generalize to much smaller or much larger models.
  • Calibration randomness causes large variance; some results depend on calibration set choice and seeds.
  • Privacy and some OOD measures are sensitive to evaluation details and refusal-rate handling.

When Not To Use

  • When you require extreme compression (3-bit) in safety-sensitive systems without extensive trust testing.
  • If you must use aggressive structured pruning (50% N:M) without per-dimension validation.
  • When source dense model is unaligned — compression preserves source weaknesses.

Failure Modes

  • Loss of instruction-following after aggressive quantization, producing malformed or unsafe outputs.
  • Sharp increases in toxicity when using GPTQ at 3-bit due to low refusal rates.
  • Higher privacy leakage or lower PII protection at low bit quantization in some setups.
  • Large run-to-run variance tied to calibration data causing unpredictable trust outcomes.

Core Entities

Models

  • LLAMA2 13b
  • LLAMA2 13b Chat
  • Vicuna 13b Chat

Metrics

  • Normalized DecodingTrust scores (0-100 points)
  • Accuracy
  • Equalized Odds Difference (EOD)
  • False Positive Rate (FPR) for Ethics
  • Refusal rate
  • MT-Bench score (1-10)

Datasets

  • DecodingTrust (modified)
  • MMLU
  • C4 (calibration sets)
  • Enron PII (privacy tests)
  • RealtimeQA (OOD knowledge)

Benchmarks

  • DecodingTrust (8 trust dimensions)
  • MMLU
  • AdvGLUE++
  • AdvDemonstration
  • MT-Bench (instruction-following probe)