Learn orthonormal rotations to remove outliers and make 4-bit LLMs accurate and fast

May 26, 20248 min

Overview

Production Readiness

0.85

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

6

Authors

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

Links

Abstract / PDF

Why It Matters For Business

SpinQuant makes extreme low-bit LLM inference practical: big memory and latency savings with near-full accuracy, using a small calibration step and without changing model APIs.

Summary TLDR

SpinQuant inserts small learned rotation matrices into transformer residuals and attention heads to spread out activation and weight outliers, then optimizes those rotations on the Stiefel manifold (Cayley SGD). This makes post-training quantization far more reliable. On many models (LLaMA-2/3, Mistral) SpinQuant closes most of the accuracy gap for extreme 4-bit quantization (weights, activations, KV cache). It is compatible with standard weight-quantizers (GPTQ), adds little inference overhead when rotations are merged into weights, and needs only a few minutes to a few hours to optimize depending on model size.

Problem Statement

Post-training quantization of LLMs reduces cost but fails when activation or weight outliers blow up the quantization range. Random rotations can reduce outliers but give high variance and inconsistent results. The paper asks: can we learn rotation matrices that (1) do not change full-precision outputs, (2) reduce outliers, and (3) minimize quantized-network loss to make low-bit LLMs accurate and stable?

Main Contribution

Define rotation parameterizations for transformer residuals and attention that are numerically identity in full precision but reduce outliers for quantization.

Introduce SpinQuant: learn orthonormal rotations (R1,R2) on the Stiefel manifold via Cayley SGD to directly minimize quantized-network loss, with optional online Hadamard rotations (R3,R4) for extreme activation/KV quantization.

Show broad empirical gains across seven LLMs and multiple bit-widths: SpinQuant narrows accuracy gaps for W4A4KV4 and W4A8 settings, is compatible with GPTQ, and adds modest latency when using Hadamard transforms.

Key Findings

Learned rotations reduce the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B in W4A4KV4.

NumbersW4A4KV4 gap = 2.9 points (LLaMA-2 7B)

SpinQuant outperforms prior PTQ baselines by large margins on extreme 4-bit settings.

Numbersbeats LLM-QAT by 19.1 pts and SmoothQuant by 25.0 pts (LLaMA-2 7B, W4A4KV4)

Random rotations give high variance in quantized accuracy; choice matters.

Numbersrandom rotations vary up to 13 points on W4A4 LLaMA-2 7B (100 seeds)

Optimizing rotations (Cayley SGD) reliably outperforms random Hadamard and floating random rotations.

Numberslearned vs random improvements up to 16.2 pts (Mistral-7B, had setting)

4-bit quantization yields large speedups and modest Hadamard overhead.

NumbersMacBook M1 Pro: FP16 token time 177.15 ms → SpinQuant no_had 58.88 ms/token (~3× speedup); had adds ~8% latency

Results

Accuracy

Value2.9 points gap (LLaMA-2 7B, W4A4KV4, SpinQuant had)

Baselinefull-precision

Improvement over LLM-QAT / SmoothQuant

Value↑19.1 pts vs LLM-QAT; ↑25.0 pts vs SmoothQuant (LLaMA-2 7B, W4A4KV4)

BaselineLLM-QAT, SmoothQuant

Random-rotation variance

ValueUp to 13 points difference across 100 random rotations (W4A4 LLaMA-2 7B)

Baselinerandom rotations

Optimization time

Value13–30 min for small models; ~3.5 hours for 70B (Cayley SGD, 800 samples, 100 iterations)

Baselinenone

Inference speed on CPU

ValueFP16 177.15 ms/token → SpinQuant no_had 58.88 ms/token (~3×)

BaselineFP16 (W16A16)

Who Should Care

What To Try In 7 Days

Run SpinQuant no_had on a 7B/3B model: optimize rotations on 800 WikiText2 samples (100 iters) and measure zero-shot task accuracy.

Combine learned rotations with your existing GPTQ weight quantizer to test W4A8 and W4A4KV4 trade-offs.

Benchmark CPU latency with and without online Hadamard to decide between no_had (faster) and had (more accurate).

Optimization Features

Token Efficiency

  • Accuracy

Infra Optimization

  • no special kernels required for SpinQuant no_had; optional Hadamard kernels speed up had variant

Model Optimization

  • learned orthonormal rotations (R1,R2)
  • online Hadamard rotations (R3,R4) for activations/KV

System Optimization

  • compatible with GPTQ weight post-training quantizer

Training Optimization

  • Cayley SGD on Stiefel manifold (optimize rotations only)
  • uses small calibration set (128–800 samples) and ~100 iterations

Inference Optimization

  • merge rotation into weights (no runtime change) for no_had
  • fast Hadamard transform when online rotations are used (~8% extra latency)

Reproducibility

Data Urls

  • WikiText2 (used for calibration and evaluation)
  • C4 (used in ablation)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires a small calibration dataset and a short optimization run (minutes to hours depending on model size).
  • Online Hadamard rotations add ≈8% inference latency; weigh accuracy vs latency.
  • Primary evaluation is zero-shot commonsense tasks and WikiText2 perplexity; other downstream tasks may show different gains.
  • SpinQuant optimizes rotations for quantized networks but does not change pre-trained weights; extremely bad outliers or architectural differences may still limit gains.

When Not To Use

  • If you cannot run any calibration or optimization time (no extra minutes allowed), skip rotation learning.
  • If even an 8% latency overhead is unacceptable and you require the had variant's extra accuracy.
  • For use-cases tested on very different tasks than zero-shot commonsense reasoning without re-evaluation.

Failure Modes

  • Poor optimization budget (too few samples/iterations) can yield suboptimal rotations.
  • Rotation may improve SNR for important layers but worsen less-critical layers; overall impact depends on layer sensitivity.
  • If a deployment can only use unusual quantization hardware that does not support merging rotations or Hadamard kernels, extra engineering is needed.

Core Entities

Models

  • LLaMA-2 7B
  • LLaMA-2 13B
  • LLaMA-2 70B
  • LLaMA-3 1B
  • LLaMA-3 3B
  • LLaMA-3 8B
  • Mistral-7B

Metrics

  • Accuracy
  • WikiText2 perplexity
  • signal-to-quantization-noise ratio (dB)
  • inference ms/token

Datasets

  • WikiText2
  • C4

Benchmarks

  • BoolQ
  • PIQA
  • SIQA
  • HellaSwag
  • WinoGrande
  • ARC-easy
  • ARC-challenge
  • OBQA