Learn orthonormal rotations to remove outliers and make 4-bit LLMs accurate and fast

Overview

Decision SnapshotReady For Pilot

Method is simple to add (learn small rotation matrices), works across many open LLMs, needs modest calibration and time; gains are backed by multiple models and tasks and compatible with GPTQ.

Citations6

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 85%

Novelty: 70%

Authors

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SpinQuant makes extreme low-bit LLM inference practical: big memory and latency savings with near-full accuracy, using a small calibration step and without changing model APIs.

Who Should Care

ML Engineer Engineering Lead Founder Product Manager

Summary TLDR

SpinQuant inserts small learned rotation matrices into transformer residuals and attention heads to spread out activation and weight outliers, then optimizes those rotations on the Stiefel manifold (Cayley SGD). This makes post-training quantization far more reliable. On many models (LLaMA-2/3, Mistral) SpinQuant closes most of the accuracy gap for extreme 4-bit quantization (weights, activations, KV cache). It is compatible with standard weight-quantizers (GPTQ), adds little inference overhead when rotations are merged into weights, and needs only a few minutes to a few hours to optimize depending on model size.

Problem Statement

Post-training quantization of LLMs reduces cost but fails when activation or weight outliers blow up the quantization range. Random rotations can reduce outliers but give high variance and inconsistent results. The paper asks: can we learn rotation matrices that (1) do not change full-precision outputs, (2) reduce outliers, and (3) minimize quantized-network loss to make low-bit LLMs accurate and stable?

Main Contribution

Define rotation parameterizations for transformer residuals and attention that are numerically identity in full precision but reduce outliers for quantization.

Introduce SpinQuant: learn orthonormal rotations (R1,R2) on the Stiefel manifold via Cayley SGD to directly minimize quantized-network loss, with optional online Hadamard rotations (R3,R4) for extreme activation/KV quantization.

Key Findings

Learned rotations reduce the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B in W4A4KV4.

NumbersW4A4KV4 gap = 2.9 points (LLaMA-2 7B)

Practical UseYou can run LLaMA-2 7B at 4-bit weights/activations/KV with almost full accuracy using SpinQuant; use SpinQuant had for best accuracy.

Evidence RefAbstract; Sec.4.2; Table 1

SpinQuant outperforms prior PTQ baselines by large margins on extreme 4-bit settings.

Numbersbeats LLM-QAT by 19.1 pts and SmoothQuant by 25.0 pts (LLaMA-2 7B, W4A4KV4)

Practical UseIf prior PTQ methods fail on W4A4KV4, try SpinQuant to regain substantial accuracy with modest extra calibration work.

Evidence RefAbstract; Sec.4.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	2.9 points gap (LLaMA-2 7B, W4A4KV4, SpinQuant had)	full-precision	—	8 zero-shot commonsense tasks	Sec.4.2; Table 1	Table 1
Improvement over LLM-QAT / SmoothQuant	↑19.1 pts vs LLM-QAT; ↑25.0 pts vs SmoothQuant (LLaMA-2 7B, W4A4KV4)	LLM-QAT, SmoothQuant	19.1 / 25.0	8 zero-shot commonsense tasks	Abstract; Sec.4.2; Table 1	Abstract; Table 1

What To Try In 7 Days

Run SpinQuant no_had on a 7B/3B model: optimize rotations on 800 WikiText2 samples (100 iters) and measure zero-shot task accuracy.

Combine learned rotations with your existing GPTQ weight quantizer to test W4A8 and W4A4KV4 trade-offs.

Benchmark CPU latency with and without online Hadamard to decide between no_had (faster) and had (more accurate).

Optimization Features

Token Efficiency

Accuracy

Infra Optimization

no special kernels required for SpinQuant no_had; optional Hadamard kernels speed up had variant

Model Optimization

learned orthonormal rotations (R1,R2)online Hadamard rotations (R3,R4) for activations/KV

System Optimization

compatible with GPTQ weight post-training quantizer

Training Optimization

Cayley SGD on Stiefel manifold (optimize rotations only)uses small calibration set (128–800 samples) and ~100 iterations

Inference Optimization

merge rotation into weights (no runtime change) for no_hadfast Hadamard transform when online rotations are used (~8% extra latency)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/facebookresearch/SpinQuant

Data URLs

WikiText2 (used for calibration and evaluation)C4 (used in ablation)

Risks & Boundaries

Limitations

Requires a small calibration dataset and a short optimization run (minutes to hours depending on model size).

Online Hadamard rotations add ≈8% inference latency; weigh accuracy vs latency.

When Not To Use

If you cannot run any calibration or optimization time (no extra minutes allowed), skip rotation learning.

If even an 8% latency overhead is unacceptable and you require the had variant's extra accuracy.

Failure Modes

Poor optimization budget (too few samples/iterations) can yield suboptimal rotations.

Rotation may improve SNR for important layers but worsen less-critical layers; overall impact depends on layer sensitivity.

Core Entities

Models

LLaMA-2 7BLLaMA-2 13BLLaMA-2 70BLLaMA-3 1BLLaMA-3 3BLLaMA-3 8BMistral-7B

Metrics

AccuracyWikiText2 perplexitysignal-to-quantization-noise ratio (dB)inference ms/token

Datasets

WikiText2C4

Benchmarks

BoolQPIQASIQAHellaSwagWinoGrandeARC-easyARC-challengeOBQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Learned rotations reduce the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B in W4A4KV4.

SpinQuant outperforms prior PTQ baselines by large margins on extreme 4-bit settings.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding