Keep early 'pivot' tokens' KV cache full‑precision to cut quantization error and restore LLM accuracy

March 2, 20248 min

Overview

Decision SnapshotReady For Pilot

Method is simple, low‑risk and plugs into existing quantizers; evidence covers multiple models, quantizers, and benchmarks, though tests do not cover every LLM task or long‑context behaviour.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A tiny precomputed full‑precision KV prefix fixes a large fraction of quality loss from 3–4 bit quantization, enabling cheaper LLM serving while keeping near‑full performance.

Who Should Care

Summary TLDR

Quantization of LLMs (3–4 bit weights/activations) often fails because very large activation outliers appear on a few early tokens ("pivot tokens" like [BOS], ",", "."). IntactKV stores the full‑precision key/value (KV) cache for those early tokens and feeds it as a prefix to the quantized model. This simple plugin costs almost no runtime overhead, can be calibrated with a tiny dataset (128 samples, ~10–20 minutes per 7–13B model), and consistently reduces quantization loss across AWQ/GPTQ/OmniQuant/QuaRot on LLaMA/Vicuna/others. Examples: MT‑Bench GPT‑4 score for a 13B Vicuna improved from 5.17→5.34 and to 5.44 after calibration (INT3); INT4 weight+activation results match or approach full‑

Problem Statement

Low-bit quantization reduces memory and compute but breaks LLM quality because extreme activation outliers appear on a few initial tokens (pivot tokens). Those tokens attract attention (attention sinks) and small quantization distortions there propagate widely, hurting generation and downstream accuracy.

Main Contribution

Discovery and analysis of token‑specific activation outliers ("pivot tokens") that cause attention sinks and high quantization sensitivity.

IntactKV: keep the KV cache of pivot tokens lossless from the full‑precision model and prepend it to the quantized model at inference, requiring no extra runtime cost.

Key Findings

Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.

Numbersactivation peaks >1e3 at pivot channels

Practical UseProtect the KV cache for those first tokens to avoid a small quantization error cascades into large attention changes.

Evidence RefFigure 1, Appendix C.1

Keeping more pivot tokens' KV cache lossless monotonically reduces quantization MSE in attention and transformer outputs.

Practical UseMake IntactKV prefix at least the [BOS] token; for fine‑tuned chat models include the system prompt to cover more pivots.

Evidence RefFigure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PPL (C4)AWQ+INTACTKV reduces PPL compared to AWQ in INT3/INT4 settingsAWQ aloneexample: LLaMA-7B INT4 OmniQuant 17.0316.24 (−0.79)C4Table 6 (INT4 results)Table 6
MMLU (5-shot average)Vicuna‑v1.313B AWQ 48.56 → +INTACTKV 49.05AWQ (INT3-group128)+0.49 ppMMLU 5-shotTable 3 (INT3 weight-only results)Table 3

What To Try In 7 Days

Generate and save FP KV cache for the [BOS] token from your FP model and prepend it to your quantized model during inference.

If you deploy a chat/SFT model, generate FP KV for the system prompt instead of just [BOS]; measure MMLU/MT‑Bench before/after.

Run a 128‑sample calibration pass to fine‑tune the stored KV (20 epochs, AdamW lr=2e‑4) — takes ~10 min on a single device for 7B models.

Optimization Features

Token Efficiency
protects early tokens that disproportionately affect attention
Infra Optimization
works with existing group/ per-head quantizers (group size 128)
Model Optimization
weight quantizationactivation quantizationKV cache mixed precision
System Optimization
keeps small FP16 prefix; rest of KV/cache can be quantized
Training Optimization
lightweight calibration of stored KV (20 epochs, 128 samples)
Inference Optimization
zero additional runtime overhead (precomputed KV prefix)faster prefill stage when KV prefix is provided

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation covers PPL, MMLU, commonsense QA and MT‑Bench but not all LLM abilities (e.g., very long contexts).

INTACTKV requires a full‑precision model to generate the KV prefix (though the FP model can be discarded afterward).

When Not To Use

You cannot run a full‑precision model even once to generate the KV prefix.

Your application uses no stable early prompt/pivot tokens (randomized or adversarial prefixes).

Failure Modes

Pivot tokens differ across deployment prompts; stored KV may not cover the actual pivot tokens and provide limited benefit.

When activation quantization forces all KV to low bits, quantizing INTACTKV may reduce its benefit (though loss is small in reported tests).

Core Entities

Models

LLaMALLaMA-2LLaMA-3Vicuna-v1.3Vicuna-v1.5OPT-6.7BMistral-7B

Metrics

PPLAccuracyMT-bench GPT-4 scoreMSE (quantization loss)

Datasets

C4WikiText2PileShareGPTMMLUMT-benchOBQAWinoGrandeARCBoolQHellaSwagLAMBADA

Benchmarks

Perplexity (PPL)MMLUCommonsense QAMT-bench