Learn orthonormal rotations to remove outliers and make 4-bit LLMs accurate and fast

May 26, 20248 min

Overview

Decision SnapshotReady For Pilot

Method is simple to add (learn small rotation matrices), works across many open LLMs, needs modest calibration and time; gains are backed by multiple models and tasks and compatible with GPTQ.

Citations6

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 85%

Novelty: 70%

Authors

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SpinQuant makes extreme low-bit LLM inference practical: big memory and latency savings with near-full accuracy, using a small calibration step and without changing model APIs.

Who Should Care

Summary TLDR

SpinQuant inserts small learned rotation matrices into transformer residuals and attention heads to spread out activation and weight outliers, then optimizes those rotations on the Stiefel manifold (Cayley SGD). This makes post-training quantization far more reliable. On many models (LLaMA-2/3, Mistral) SpinQuant closes most of the accuracy gap for extreme 4-bit quantization (weights, activations, KV cache). It is compatible with standard weight-quantizers (GPTQ), adds little inference overhead when rotations are merged into weights, and needs only a few minutes to a few hours to optimize depending on model size.

Problem Statement

Post-training quantization of LLMs reduces cost but fails when activation or weight outliers blow up the quantization range. Random rotations can reduce outliers but give high variance and inconsistent results. The paper asks: can we learn rotation matrices that (1) do not change full-precision outputs, (2) reduce outliers, and (3) minimize quantized-network loss to make low-bit LLMs accurate and stable?

Main Contribution

Define rotation parameterizations for transformer residuals and attention that are numerically identity in full precision but reduce outliers for quantization.

Introduce SpinQuant: learn orthonormal rotations (R1,R2) on the Stiefel manifold via Cayley SGD to directly minimize quantized-network loss, with optional online Hadamard rotations (R3,R4) for extreme activation/KV quantization.

Key Findings

Learned rotations reduce the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B in W4A4KV4.

NumbersW4A4KV4 gap = 2.9 points (LLaMA-2 7B)

Practical UseYou can run LLaMA-2 7B at 4-bit weights/activations/KV with almost full accuracy using SpinQuant; use SpinQuant had for best accuracy.

Evidence RefAbstract; Sec.4.2; Table 1

SpinQuant outperforms prior PTQ baselines by large margins on extreme 4-bit settings.

Numbersbeats LLM-QAT by 19.1 pts and SmoothQuant by 25.0 pts (LLaMA-2 7B, W4A4KV4)

Practical UseIf prior PTQ methods fail on W4A4KV4, try SpinQuant to regain substantial accuracy with modest extra calibration work.

Evidence RefAbstract; Sec.4.2; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy2.9 points gap (LLaMA-2 7B, W4A4KV4, SpinQuant had)full-precision8 zero-shot commonsense tasksSec.4.2; Table 1Table 1
Improvement over LLM-QAT / SmoothQuant19.1 pts vs LLM-QAT; ↑25.0 pts vs SmoothQuant (LLaMA-2 7B, W4A4KV4)LLM-QAT, SmoothQuant19.1 / 25.08 zero-shot commonsense tasksAbstract; Sec.4.2; Table 1Abstract; Table 1

What To Try In 7 Days

Run SpinQuant no_had on a 7B/3B model: optimize rotations on 800 WikiText2 samples (100 iters) and measure zero-shot task accuracy.

Combine learned rotations with your existing GPTQ weight quantizer to test W4A8 and W4A4KV4 trade-offs.

Benchmark CPU latency with and without online Hadamard to decide between no_had (faster) and had (more accurate).

Optimization Features

Token Efficiency
Accuracy
Infra Optimization
no special kernels required for SpinQuant no_had; optional Hadamard kernels speed up had variant
Model Optimization
learned orthonormal rotations (R1,R2)online Hadamard rotations (R3,R4) for activations/KV
System Optimization
compatible with GPTQ weight post-training quantizer
Training Optimization
Cayley SGD on Stiefel manifold (optimize rotations only)uses small calibration set (128–800 samples) and ~100 iterations
Inference Optimization
merge rotation into weights (no runtime change) for no_hadfast Hadamard transform when online rotations are used (~8% extra latency)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

WikiText2 (used for calibration and evaluation)C4 (used in ablation)

Risks & Boundaries

Limitations

Requires a small calibration dataset and a short optimization run (minutes to hours depending on model size).

Online Hadamard rotations add ≈8% inference latency; weigh accuracy vs latency.

When Not To Use

If you cannot run any calibration or optimization time (no extra minutes allowed), skip rotation learning.

If even an 8% latency overhead is unacceptable and you require the had variant's extra accuracy.

Failure Modes

Poor optimization budget (too few samples/iterations) can yield suboptimal rotations.

Rotation may improve SNR for important layers but worsen less-critical layers; overall impact depends on layer sensitivity.

Core Entities

Models

LLaMA-2 7BLLaMA-2 13BLLaMA-2 70BLLaMA-3 1BLLaMA-3 3BLLaMA-3 8BMistral-7B

Metrics

AccuracyWikiText2 perplexitysignal-to-quantization-noise ratio (dB)inference ms/token

Datasets

WikiText2C4

Benchmarks

BoolQPIQASIQAHellaSwagWinoGrandeARC-easyARC-challengeOBQA