Overview
Method is simple, low‑risk and plugs into existing quantizers; evidence covers multiple models, quantizers, and benchmarks, though tests do not cover every LLM task or long‑context behaviour.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
A tiny precomputed full‑precision KV prefix fixes a large fraction of quality loss from 3–4 bit quantization, enabling cheaper LLM serving while keeping near‑full performance.
Who Should Care
Summary TLDR
Quantization of LLMs (3–4 bit weights/activations) often fails because very large activation outliers appear on a few early tokens ("pivot tokens" like [BOS], ",", "."). IntactKV stores the full‑precision key/value (KV) cache for those early tokens and feeds it as a prefix to the quantized model. This simple plugin costs almost no runtime overhead, can be calibrated with a tiny dataset (128 samples, ~10–20 minutes per 7–13B model), and consistently reduces quantization loss across AWQ/GPTQ/OmniQuant/QuaRot on LLaMA/Vicuna/others. Examples: MT‑Bench GPT‑4 score for a 13B Vicuna improved from 5.17→5.34 and to 5.44 after calibration (INT3); INT4 weight+activation results match or approach full‑
Problem Statement
Low-bit quantization reduces memory and compute but breaks LLM quality because extreme activation outliers appear on a few initial tokens (pivot tokens). Those tokens attract attention (attention sinks) and small quantization distortions there propagate widely, hurting generation and downstream accuracy.
Main Contribution
Discovery and analysis of token‑specific activation outliers ("pivot tokens") that cause attention sinks and high quantization sensitivity.
IntactKV: keep the KV cache of pivot tokens lossless from the full‑precision model and prepend it to the quantized model at inference, requiring no extra runtime cost.
Key Findings
Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.
Keeping more pivot tokens' KV cache lossless monotonically reduces quantization MSE in attention and transformer outputs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| PPL (C4) | AWQ+INTACTKV reduces PPL compared to AWQ in INT3/INT4 settings | AWQ alone | example: LLaMA-7B INT4 OmniQuant 17.03→16.24 (−0.79) | C4 | Table 6 (INT4 results) | Table 6 |
| MMLU (5-shot average) | Vicuna‑v1.3‑13B AWQ 48.56 → +INTACTKV 49.05 | AWQ (INT3-group128) | +0.49 pp | MMLU 5-shot | Table 3 (INT3 weight-only results) | Table 3 |
What To Try In 7 Days
Generate and save FP KV cache for the [BOS] token from your FP model and prepend it to your quantized model during inference.
If you deploy a chat/SFT model, generate FP KV for the system prompt instead of just [BOS]; measure MMLU/MT‑Bench before/after.
Run a 128‑sample calibration pass to fine‑tune the stored KV (20 epochs, AdamW lr=2e‑4) — takes ~10 min on a single device for 7B models.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation covers PPL, MMLU, commonsense QA and MT‑Bench but not all LLM abilities (e.g., very long contexts).
INTACTKV requires a full‑precision model to generate the KV prefix (though the FP model can be discarded afterward).
When Not To Use
You cannot run a full‑precision model even once to generate the KV prefix.
Your application uses no stable early prompt/pivot tokens (randomized or adversarial prefixes).
Failure Modes
Pivot tokens differ across deployment prompts; stored KV may not cover the actual pivot tokens and provide limited benefit.
When activation quantization forces all KV to low bits, quantizing INTACTKV may reduce its benefit (though loss is small in reported tests).

