Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
A tiny precomputed full‑precision KV prefix fixes a large fraction of quality loss from 3–4 bit quantization, enabling cheaper LLM serving while keeping near‑full performance.
Summary TLDR
Quantization of LLMs (3–4 bit weights/activations) often fails because very large activation outliers appear on a few early tokens ("pivot tokens" like [BOS], ",", "."). IntactKV stores the full‑precision key/value (KV) cache for those early tokens and feeds it as a prefix to the quantized model. This simple plugin costs almost no runtime overhead, can be calibrated with a tiny dataset (128 samples, ~10–20 minutes per 7–13B model), and consistently reduces quantization loss across AWQ/GPTQ/OmniQuant/QuaRot on LLaMA/Vicuna/others. Examples: MT‑Bench GPT‑4 score for a 13B Vicuna improved from 5.17→5.34 and to 5.44 after calibration (INT3); INT4 weight+activation results match or approach full‑
Problem Statement
Low-bit quantization reduces memory and compute but breaks LLM quality because extreme activation outliers appear on a few initial tokens (pivot tokens). Those tokens attract attention (attention sinks) and small quantization distortions there propagate widely, hurting generation and downstream accuracy.
Main Contribution
Discovery and analysis of token‑specific activation outliers ("pivot tokens") that cause attention sinks and high quantization sensitivity.
IntactKV: keep the KV cache of pivot tokens lossless from the full‑precision model and prepend it to the quantized model at inference, requiring no extra runtime cost.
Calibrate IntactKV as a tiny set of trainable KV parameters to compensate residual quantization error (cheap: 128 samples, ~10–20 minutes for 7–13B).
Mathematical bound showing removing pivot token errors lowers the attention output error; broad empirical gains across quantizers, models, and benchmarks.
Key Findings
Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.
Keeping more pivot tokens' KV cache lossless monotonically reduces quantization MSE in attention and transformer outputs.
Small but consistent downstream accuracy gains across multiple tasks and quantizers.
Calibrating IntactKV recovers generation quality judged by GPT‑4 on MT‑Bench.
IntactKV improves weight+activation quantization and helps reach new SOTA in some settings.
Results
PPL (C4)
MMLU (5-shot average)
Commonsense QA (avg acc)
MT‑Bench (GPT‑4 score, scale 10)
Who Should Care
What To Try In 7 Days
Generate and save FP KV cache for the [BOS] token from your FP model and prepend it to your quantized model during inference.
If you deploy a chat/SFT model, generate FP KV for the system prompt instead of just [BOS]; measure MMLU/MT‑Bench before/after.
Run a 128‑sample calibration pass to fine‑tune the stored KV (20 epochs, AdamW lr=2e‑4) — takes ~10 min on a single device for 7B models.
Optimization Features
Token Efficiency
- protects early tokens that disproportionately affect attention
Infra Optimization
- works with existing group/ per-head quantizers (group size 128)
Model Optimization
- weight quantization
- activation quantization
- KV cache mixed precision
System Optimization
- keeps small FP16 prefix; rest of KV/cache can be quantized
Training Optimization
- lightweight calibration of stored KV (20 epochs, 128 samples)
Inference Optimization
- zero additional runtime overhead (precomputed KV prefix)
- faster prefill stage when KV prefix is provided
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation covers PPL, MMLU, commonsense QA and MT‑Bench but not all LLM abilities (e.g., very long contexts).
- INTACTKV requires a full‑precision model to generate the KV prefix (though the FP model can be discarded afterward).
- Keeping INTACTKV in FP16 increases memory slightly; quantizing the stored KV for activation quantization can incur small accuracy loss.
When Not To Use
- You cannot run a full‑precision model even once to generate the KV prefix.
- Your application uses no stable early prompt/pivot tokens (randomized or adversarial prefixes).
- Tasks that rely on very long contexts and where pivot tokens shift beyond the prefix (unknown effect).
Failure Modes
- Pivot tokens differ across deployment prompts; stored KV may not cover the actual pivot tokens and provide limited benefit.
- When activation quantization forces all KV to low bits, quantizing INTACTKV may reduce its benefit (though loss is small in reported tests).
- Overfitting INTACTKV to the calibration set could harm generalization if the calibration data is not similar to production inputs.
Core Entities
Models
- LLaMA
- LLaMA-2
- LLaMA-3
- Vicuna-v1.3
- Vicuna-v1.5
- OPT-6.7B
- Mistral-7B
Metrics
- PPL
- Accuracy
- MT-bench GPT-4 score
- MSE (quantization loss)
Datasets
- C4
- WikiText2
- Pile
- ShareGPT
- MMLU
- MT-bench
- OBQA
- WinoGrande
- ARC
- BoolQ
- HellaSwag
- LAMBADA
Benchmarks
- Perplexity (PPL)
- MMLU
- Commonsense QA
- MT-bench

