Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
3
Why It Matters For Business
QuaRot makes production LLM inference much cheaper and memory-light by enabling true end-to-end 4-bit execution and large KV cache compression, so hosting large models on cheaper GPUs or smaller clusters becomes practical.
Summary TLDR
QuaRot uses randomized Hadamard rotations fused into model weights to remove large activation outliers, enabling end-to-end 4-bit inference (weights, activations, KV cache) without keeping special high-precision channels. On LLAMA2-70B QuaRot gives at most +0.47 WikiText-2 perplexity, preserves ~99% of zero-shot accuracy, yields up to 3.33× prefill speedup and ~3.89× decoding memory savings on consumer GPUs. Code: github.com/spcl/QuaRot.
Problem Statement
Activation outliers (rare large values) make activation and KV-cache quantization hard. Prior fixes keep outliers in higher precision or use calibration. This prevents true end-to-end 4-bit inference with acceptable accuracy and memory savings.
Main Contribution
A practical method (QuaRot) that fuses randomized Hadamard rotations into transformer weights to remove activation outliers without changing model outputs.
An attention-aware extension that rotates keys and values so the KV cache can be quantized.
End-to-end pipelines and CUDA kernels that run all matrix multiplies in INT4, achieving large memory and prefill speed gains while keeping accuracy high.
Key Findings
4-bit end-to-end quantization on LLAMA2-70B with small accuracy loss
Significant runtime and memory gains on consumer GPUs
6- and 8-bit quantization can be lossless without calibration
Hadamard rotations remove activation outliers
Results
WikiText-2 Perplexity (LLAMA2-70B)
Zero-shot average (LLAMA2-70B)
Prefill speedup (single transformer block)
Peak decoding memory saving (KV cache)
RTN 8-bit quality
Who Should Care
What To Try In 7 Days
Run QuaRot on a copy of your FP16 LLAMA-2/3 model using the published code and compare WikiText-2 PPL and a couple of zero-shot tasks.
Measure prefill throughput and KV memory usage on your target GPU, focusing on large batch and long context.
If you need conservative rollout, try RTN 8-bit or 6-bit first — they are near-lossless and need no calibration data.
Agent Features
Memory
- KV cache quantization (group-wise asymmetric)
Tool Use
- Hadamard rotations fused into weights
- online per-token symmetric quantization
Frameworks
- PyTorch
- CUTLASS INT4 kernels
Architectures
- Transformer
Optimization Features
Infra Optimization
- designed and measured on RTX 3090 consumer GPUs
Model Optimization
- fuse randomized Hadamard rotations into weights to reduce incoherence
- per-column GPTQ or RTN weight quantization
System Optimization
- 3×–3.3× prefill speedup (large batch)
- ≈3.6–3.9× KV memory reduction during decoding
Inference Optimization
- all INT4 matmuls with INT32 accumulation and FP16 cast
- online Hadamard transforms to remove activation outliers
- on-the-fly per-token symmetric activation quantization
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benefits depend on fast INT4 GEMM kernels; you need CUDA/CUTLASS support or similar.
- Random orthogonal transforms perform worse than Hadamard; hidden dimension constraints (power-of-two factors) simplify fast Hadamard use.
- LLAMA-3 and some smaller models are more sensitive to 4-bit quantization; quality varies by model size and group size.
When Not To Use
- If your target hardware lacks fast INT4 support or optimized kernels.
- When you cannot afford any drop in small-model accuracy; RTN 4-bit can fail on small models.
- For extreme low-latency single-token decode on tiny batches where INT4 cache overhead may be slower than FP16.
Failure Modes
- Round-to-nearest (RTN) at 4 bits can cause large quality drops on small models; GPTQ is safer for small sizes.
- Using random orthogonal matrices instead of structured Hadamard increases perplexity.
- Group-size trade-offs: smaller groups improve accuracy but increase metadata and kernel complexity.
Core Entities
Models
- LLAMA-2 (7B,13B,70B)
- LLAMA-3 (8B,70B)
- Phi-3-mini-4k-instruct
Metrics
- Perplexity (PPL)
- Accuracy
- Prefill time speedup
- Peak decoding memory saving
Datasets
- WikiText-2 (used for perplexity/calibration)
- Zero-shot tasks: PIQA, WinoGrande, HellaSwag, LAMBADA, ARC-Easy, ARC-Challenge
Benchmarks
- WikiText-2 perplexity
- Accuracy

