Keep early 'pivot' tokens' KV cache full‑precision to cut quantization error and restore LLM accuracy

Overview

Decision SnapshotReady For Pilot

Method is simple, low‑risk and plugs into existing quantizers; evidence covers multiple models, quantizers, and benchmarks, though tests do not cover every LLM task or long‑context behaviour.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A tiny precomputed full‑precision KV prefix fixes a large fraction of quality loss from 3–4 bit quantization, enabling cheaper LLM serving while keeping near‑full performance.

Who Should Care

ML Engineer Engineering Lead Data Scientist CTO

Summary TLDR

Quantization of LLMs (3–4 bit weights/activations) often fails because very large activation outliers appear on a few early tokens ("pivot tokens" like [BOS], ",", "."). IntactKV stores the full‑precision key/value (KV) cache for those early tokens and feeds it as a prefix to the quantized model. This simple plugin costs almost no runtime overhead, can be calibrated with a tiny dataset (128 samples, ~10–20 minutes per 7–13B model), and consistently reduces quantization loss across AWQ/GPTQ/OmniQuant/QuaRot on LLaMA/Vicuna/others. Examples: MT‑Bench GPT‑4 score for a 13B Vicuna improved from 5.17→5.34 and to 5.44 after calibration (INT3); INT4 weight+activation results match or approach full‑

Problem Statement

Low-bit quantization reduces memory and compute but breaks LLM quality because extreme activation outliers appear on a few initial tokens (pivot tokens). Those tokens attract attention (attention sinks) and small quantization distortions there propagate widely, hurting generation and downstream accuracy.

Main Contribution

Discovery and analysis of token‑specific activation outliers ("pivot tokens") that cause attention sinks and high quantization sensitivity.

IntactKV: keep the KV cache of pivot tokens lossless from the full‑precision model and prepend it to the quantized model at inference, requiring no extra runtime cost.

Key Findings

Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.

Numbersactivation peaks >1e3 at pivot channels

Practical UseProtect the KV cache for those first tokens to avoid a small quantization error cascades into large attention changes.

Evidence RefFigure 1, Appendix C.1

Keeping more pivot tokens' KV cache lossless monotonically reduces quantization MSE in attention and transformer outputs.

Practical UseMake IntactKV prefix at least the [BOS] token; for fine‑tuned chat models include the system prompt to cover more pivots.

Evidence RefFigure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PPL (C4)	AWQ+INTACTKV reduces PPL compared to AWQ in INT3/INT4 settings	AWQ alone	example: LLaMA-7B INT4 OmniQuant 17.03→16.24 (−0.79)	C4	Table 6 (INT4 results)	Table 6
MMLU (5-shot average)	Vicuna‑v1.3‑13B AWQ 48.56 → +INTACTKV 49.05	AWQ (INT3-group128)	+0.49 pp	MMLU 5-shot	Table 3 (INT3 weight-only results)	Table 3

What To Try In 7 Days

Generate and save FP KV cache for the [BOS] token from your FP model and prepend it to your quantized model during inference.

If you deploy a chat/SFT model, generate FP KV for the system prompt instead of just [BOS]; measure MMLU/MT‑Bench before/after.

Run a 128‑sample calibration pass to fine‑tune the stored KV (20 epochs, AdamW lr=2e‑4) — takes ~10 min on a single device for 7B models.

Optimization Features

Token Efficiency

protects early tokens that disproportionately affect attention

Infra Optimization

works with existing group/ per-head quantizers (group size 128)

Model Optimization

weight quantizationactivation quantizationKV cache mixed precision

System Optimization

keeps small FP16 prefix; rest of KV/cache can be quantized

Training Optimization

lightweight calibration of stored KV (20 epochs, 128 samples)

Inference Optimization

zero additional runtime overhead (precomputed KV prefix)faster prefill stage when KV prefix is provided

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ruikangliu/IntactKV

Data URLs

https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered

Risks & Boundaries

Limitations

Evaluation covers PPL, MMLU, commonsense QA and MT‑Bench but not all LLM abilities (e.g., very long contexts).

INTACTKV requires a full‑precision model to generate the KV prefix (though the FP model can be discarded afterward).

When Not To Use

You cannot run a full‑precision model even once to generate the KV prefix.

Your application uses no stable early prompt/pivot tokens (randomized or adversarial prefixes).

Failure Modes

Pivot tokens differ across deployment prompts; stored KV may not cover the actual pivot tokens and provide limited benefit.

When activation quantization forces all KV to low bits, quantizing INTACTKV may reduce its benefit (though loss is small in reported tests).

Core Entities

Models

LLaMALLaMA-2LLaMA-3Vicuna-v1.3Vicuna-v1.5OPT-6.7BMistral-7B

Metrics

PPLAccuracyMT-bench GPT-4 scoreMSE (quantization loss)

Datasets

C4WikiText2PileShareGPTMMLUMT-benchOBQAWinoGrandeARCBoolQHellaSwagLAMBADA

Benchmarks

Perplexity (PPL)MMLUCommonsense QAMT-bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.

Keeping more pivot tokens' KV cache lossless monotonically reduces quantization MSE in attention and transformer outputs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding