Keep early 'pivot' tokens' KV cache full‑precision to cut quantization error and restore LLM accuracy

March 2, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan

Links

Abstract / PDF

Why It Matters For Business

A tiny precomputed full‑precision KV prefix fixes a large fraction of quality loss from 3–4 bit quantization, enabling cheaper LLM serving while keeping near‑full performance.

Summary TLDR

Quantization of LLMs (3–4 bit weights/activations) often fails because very large activation outliers appear on a few early tokens ("pivot tokens" like [BOS], ",", "."). IntactKV stores the full‑precision key/value (KV) cache for those early tokens and feeds it as a prefix to the quantized model. This simple plugin costs almost no runtime overhead, can be calibrated with a tiny dataset (128 samples, ~10–20 minutes per 7–13B model), and consistently reduces quantization loss across AWQ/GPTQ/OmniQuant/QuaRot on LLaMA/Vicuna/others. Examples: MT‑Bench GPT‑4 score for a 13B Vicuna improved from 5.17→5.34 and to 5.44 after calibration (INT3); INT4 weight+activation results match or approach full‑

Problem Statement

Low-bit quantization reduces memory and compute but breaks LLM quality because extreme activation outliers appear on a few initial tokens (pivot tokens). Those tokens attract attention (attention sinks) and small quantization distortions there propagate widely, hurting generation and downstream accuracy.

Main Contribution

Discovery and analysis of token‑specific activation outliers ("pivot tokens") that cause attention sinks and high quantization sensitivity.

IntactKV: keep the KV cache of pivot tokens lossless from the full‑precision model and prepend it to the quantized model at inference, requiring no extra runtime cost.

Calibrate IntactKV as a tiny set of trainable KV parameters to compensate residual quantization error (cheap: 128 samples, ~10–20 minutes for 7–13B).

Mathematical bound showing removing pivot token errors lowers the attention output error; broad empirical gains across quantizers, models, and benchmarks.

Key Findings

Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.

Numbersactivation peaks >1e3 at pivot channels

Keeping more pivot tokens' KV cache lossless monotonically reduces quantization MSE in attention and transformer outputs.

Small but consistent downstream accuracy gains across multiple tasks and quantizers.

NumbersVicuna‑v1.3‑13B MMLU (5‑shot) avg improved +0.49pp (48.56→49.05) with AWQ+INTACTKV

Calibrating IntactKV recovers generation quality judged by GPT‑4 on MT‑Bench.

NumbersVicuna‑v1.5‑13B MT‑Bench: AWQ 5.17 → +INTACTKV+Cal 5.44 (score +0.27)

IntactKV improves weight+activation quantization and helps reach new SOTA in some settings.

NumbersINT4 average PPL improvements: OmniQuant −1.07; QuaRot −0.31 (avg)

Results

PPL (C4)

ValueAWQ+INTACTKV reduces PPL compared to AWQ in INT3/INT4 settings

BaselineAWQ alone

MMLU (5-shot average)

ValueVicuna‑v1.3‑13B AWQ 48.56 → +INTACTKV 49.05

BaselineAWQ (INT3-group128)

Commonsense QA (avg acc)

ValueVicuna‑v1.3‑13B AWQ avg 64.56 → +INTACTKV 65.02

BaselineAWQ (INT3-group128)

MT‑Bench (GPT‑4 score, scale 10)

ValueVicuna‑v1.5‑13B AWQ 5.17 → +INTACTKV+Cal 5.44

BaselineAWQ (INT3-group128)

Who Should Care

What To Try In 7 Days

Generate and save FP KV cache for the [BOS] token from your FP model and prepend it to your quantized model during inference.

If you deploy a chat/SFT model, generate FP KV for the system prompt instead of just [BOS]; measure MMLU/MT‑Bench before/after.

Run a 128‑sample calibration pass to fine‑tune the stored KV (20 epochs, AdamW lr=2e‑4) — takes ~10 min on a single device for 7B models.

Optimization Features

Token Efficiency

  • protects early tokens that disproportionately affect attention

Infra Optimization

  • works with existing group/ per-head quantizers (group size 128)

Model Optimization

  • weight quantization
  • activation quantization
  • KV cache mixed precision

System Optimization

  • keeps small FP16 prefix; rest of KV/cache can be quantized

Training Optimization

  • lightweight calibration of stored KV (20 epochs, 128 samples)

Inference Optimization

  • zero additional runtime overhead (precomputed KV prefix)
  • faster prefill stage when KV prefix is provided

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation covers PPL, MMLU, commonsense QA and MT‑Bench but not all LLM abilities (e.g., very long contexts).
  • INTACTKV requires a full‑precision model to generate the KV prefix (though the FP model can be discarded afterward).
  • Keeping INTACTKV in FP16 increases memory slightly; quantizing the stored KV for activation quantization can incur small accuracy loss.

When Not To Use

  • You cannot run a full‑precision model even once to generate the KV prefix.
  • Your application uses no stable early prompt/pivot tokens (randomized or adversarial prefixes).
  • Tasks that rely on very long contexts and where pivot tokens shift beyond the prefix (unknown effect).

Failure Modes

  • Pivot tokens differ across deployment prompts; stored KV may not cover the actual pivot tokens and provide limited benefit.
  • When activation quantization forces all KV to low bits, quantizing INTACTKV may reduce its benefit (though loss is small in reported tests).
  • Overfitting INTACTKV to the calibration set could harm generalization if the calibration data is not similar to production inputs.

Core Entities

Models

  • LLaMA
  • LLaMA-2
  • LLaMA-3
  • Vicuna-v1.3
  • Vicuna-v1.5
  • OPT-6.7B
  • Mistral-7B

Metrics

  • PPL
  • Accuracy
  • MT-bench GPT-4 score
  • MSE (quantization loss)

Datasets

  • C4
  • WikiText2
  • Pile
  • ShareGPT
  • MMLU
  • MT-bench
  • OBQA
  • WinoGrande
  • ARC
  • BoolQ
  • HellaSwag
  • LAMBADA

Benchmarks

  • Perplexity (PPL)
  • MMLU
  • Commonsense QA
  • MT-bench