Make LLM inference fully 4-bit by rotating away activation outliers

March 30, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

3

Authors

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

Links

Abstract / PDF

Why It Matters For Business

QuaRot makes production LLM inference much cheaper and memory-light by enabling true end-to-end 4-bit execution and large KV cache compression, so hosting large models on cheaper GPUs or smaller clusters becomes practical.

Summary TLDR

QuaRot uses randomized Hadamard rotations fused into model weights to remove large activation outliers, enabling end-to-end 4-bit inference (weights, activations, KV cache) without keeping special high-precision channels. On LLAMA2-70B QuaRot gives at most +0.47 WikiText-2 perplexity, preserves ~99% of zero-shot accuracy, yields up to 3.33× prefill speedup and ~3.89× decoding memory savings on consumer GPUs. Code: github.com/spcl/QuaRot.

Problem Statement

Activation outliers (rare large values) make activation and KV-cache quantization hard. Prior fixes keep outliers in higher precision or use calibration. This prevents true end-to-end 4-bit inference with acceptable accuracy and memory savings.

Main Contribution

A practical method (QuaRot) that fuses randomized Hadamard rotations into transformer weights to remove activation outliers without changing model outputs.

An attention-aware extension that rotates keys and values so the KV cache can be quantized.

End-to-end pipelines and CUDA kernels that run all matrix multiplies in INT4, achieving large memory and prefill speed gains while keeping accuracy high.

Key Findings

4-bit end-to-end quantization on LLAMA2-70B with small accuracy loss

NumbersWikiText-2 PPL +0.47 (3.32→3.79); zero-shot avg drop ~1.09 pts

Significant runtime and memory gains on consumer GPUs

Numbersprefill speedup up to 3.33×; peak decoding memory saving up to 3.89×

6- and 8-bit quantization can be lossless without calibration

Numbers8-bit RTN restores FP16-level PPL; 6-bit RTN near-lossless

Hadamard rotations remove activation outliers

NumbersActivation distributions show no visible outliers after QuaRot (Figure 1)

Results

WikiText-2 Perplexity (LLAMA2-70B)

Value3.79 (QuaRot GPTQ INT4)

Baseline3.32 (FP16)

Zero-shot average (LLAMA2-70B)

Value75.98 (QuaRot GPTQ INT4)

Baseline77.07 (FP16)

Prefill speedup (single transformer block)

Valueup to 3.33×

BaselineFP16

Peak decoding memory saving (KV cache)

Valueup to 3.89×

BaselineFP16

RTN 8-bit quality

Valuenear-FP16 (lossless within noise)

BaselineFP16

Who Should Care

What To Try In 7 Days

Run QuaRot on a copy of your FP16 LLAMA-2/3 model using the published code and compare WikiText-2 PPL and a couple of zero-shot tasks.

Measure prefill throughput and KV memory usage on your target GPU, focusing on large batch and long context.

If you need conservative rollout, try RTN 8-bit or 6-bit first — they are near-lossless and need no calibration data.

Agent Features

Memory

  • KV cache quantization (group-wise asymmetric)

Tool Use

  • Hadamard rotations fused into weights
  • online per-token symmetric quantization

Frameworks

  • PyTorch
  • CUTLASS INT4 kernels

Architectures

  • Transformer

Optimization Features

Infra Optimization

  • designed and measured on RTX 3090 consumer GPUs

Model Optimization

  • fuse randomized Hadamard rotations into weights to reduce incoherence
  • per-column GPTQ or RTN weight quantization

System Optimization

  • 3×–3.3× prefill speedup (large batch)
  • ≈3.6–3.9× KV memory reduction during decoding

Inference Optimization

  • all INT4 matmuls with INT32 accumulation and FP16 cast
  • online Hadamard transforms to remove activation outliers
  • on-the-fly per-token symmetric activation quantization

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benefits depend on fast INT4 GEMM kernels; you need CUDA/CUTLASS support or similar.
  • Random orthogonal transforms perform worse than Hadamard; hidden dimension constraints (power-of-two factors) simplify fast Hadamard use.
  • LLAMA-3 and some smaller models are more sensitive to 4-bit quantization; quality varies by model size and group size.

When Not To Use

  • If your target hardware lacks fast INT4 support or optimized kernels.
  • When you cannot afford any drop in small-model accuracy; RTN 4-bit can fail on small models.
  • For extreme low-latency single-token decode on tiny batches where INT4 cache overhead may be slower than FP16.

Failure Modes

  • Round-to-nearest (RTN) at 4 bits can cause large quality drops on small models; GPTQ is safer for small sizes.
  • Using random orthogonal matrices instead of structured Hadamard increases perplexity.
  • Group-size trade-offs: smaller groups improve accuracy but increase metadata and kernel complexity.

Core Entities

Models

  • LLAMA-2 (7B,13B,70B)
  • LLAMA-3 (8B,70B)
  • Phi-3-mini-4k-instruct

Metrics

  • Perplexity (PPL)
  • Accuracy
  • Prefill time speedup
  • Peak decoding memory saving

Datasets

  • WikiText-2 (used for perplexity/calibration)
  • Zero-shot tasks: PIQA, WinoGrande, HellaSwag, LAMBADA, ARC-Easy, ARC-Challenge

Benchmarks

  • WikiText-2 perplexity
  • Accuracy