Store more tokens at lower bit precision to shrink KV cache and often improve long-context accuracy

December 17, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li

Links

Abstract / PDF

Why It Matters For Business

You can cut KV-cache memory and often improve long-context accuracy by storing more tokens at lower precision. This reduces GPU memory cost for long inputs and enables longer effective context without model changes.

Summary TLDR

The paper studies KV cache compression for long-context LLM inference and introduces "quantized pruning": prune less-important tokens, then quantize the retained tokens. Key finding: under the same KV-cache memory budget, storing more tokens at lower precision (e.g., 4× tokens at 4-bit) often beats storing fewer tokens at higher precision (e.g., 1× tokens at 16-bit). Results hold across Llama and Mistral models, across datasets (LongBench, RULER, Needle-in-a-Haystack), and for many pruning/quantization methods. Very low bits (2-bit) usually collapse performance. Code: https://github.com/zhzihao/QPruningKV

Problem Statement

KV cache memory grows with context length and becomes a bottleneck for long-context inference. Existing methods compress either tokens (pruning) or numeric precision (quantization) separately. The paper asks whether combining both—trading precision for more tokens—yields a better memory vs. accuracy trade-off.

Main Contribution

Propose quantized pruning: prune tokens then quantize the preserved KV states to meet fixed memory budgets.

Empirically show storing more tokens at lower precision (e.g., 4× tokens at 4-bit) often outperforms fewer tokens at higher precision across budgets and models.

Analyze task types, input lengths, model scale, quantization strategies, and layer-wise effects to give practical guidance for KV cache compression.

Key Findings

Keeping more tokens at lower precision often beats keeping fewer tokens at full precision.

NumbersExample: Llama-3 RULER-8k: 512 tokens@16-bit = 67.5 vs 2048 tokens@4-bit = 82.2 (+14.7)

4-bit quantization on pruned tokens is feasible; 2-bit usually collapses quality.

NumbersTable 1: many methods keep performance at 4-bit; 2-bit shows large drops (e.g., LongBench StreamingLLM 16-bit 32.1 -> 2‑

Retrieval-style tasks gain the most from trading precision for token count.

NumbersRULER-8k: Llama-3 scores rise from 67.5 (512@16) to 82.2 (2048@4).

Quantized pruning is stable across pruning algorithms, quantizers, and model scales.

NumbersWorks with SnapKV/PyramidKV and with KIVI/FlexGen variants; consistent gains on Llama and Mistral (Figures 3,4).

Intermediate transformer layers are more sensitive to token-precision reallocation than initial/final layers.

NumbersLayer-wise swaps to higher precision/fewer tokens cause larger drops when applied to middle layers (Figure 5).

Results

RULER-8k score

Value2048 tokens @ 4-bit = 82.2

Baseline512 tokens @ 16-bit = 67.5

LongBench score

Value2048 tokens @ 4-bit = 41.3

Baseline512 tokens @ 16-bit = 40.3

Robustness to quantization

Value4-bit: minimal drop; 2-bit: large collapse

Baseline16-bit

Who Should Care

What To Try In 7 Days

Run PyramidKV or SnapKV with KIVI quantization and compare 512@16-bit vs 1024@8-bit vs 2048@4-bit under your budget.

Focus tests on retrieval-style tasks (QA, search) where token coverage matters most.

Avoid 2-bit quantization in production experiments; start at 4-bit and 8-bit for safety and accuracy checks.

Optimization Features

Token Efficiency

  • trade precision for token coverage

Infra Optimization

  • reduce KV memory footprint

System Optimization

  • memory budget allocation

Inference Optimization

  • KV Cache Optimization
  • Quantization
  • Token Budgeting
  • Context Compression
  • Layer-wise allocation
  • Group-size tuning

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Paper only explores token and precision dimensions; head and layer compression combinations remain open.
  • Current implementation has dequantization overhead that blocks full runtime speedups.
  • Very low-bit (2-bit) quantization often causes large accuracy drops.

When Not To Use

  • When you require extreme numeric fidelity per token (sensitive generation tasks).
  • When dequantization latency would dominate end-to-end throughput and cannot be optimized.
  • When you must use 2-bit quantization for maximum memory saving; quality often collapses.

Failure Modes

  • Aggressive 2-bit quantization causes drastic performance collapse.
  • Head-level pruning methods incompatible with chosen quantizer may degrade more under low precision.
  • Dequantization CPU/GPU inefficiency can erase memory savings as speed gains.

Core Entities

Models

  • Llama-3-8B-Instruct
  • Mistral-7B-Instruct-v0.2
  • Llama3-70B
  • Llama3.2-3B
  • Llama3.2-1B

Metrics

  • LongBench score
  • RULER score
  • NIAH score

Datasets

  • LongBench
  • Needle-in-a-Haystack
  • RULER

Benchmarks

  • LongBench
  • RULER
  • Needle-in-a-Haystack