Compress KV cache by keeping semantic chunks (not single tokens) to save memory and speed up long-context LLMs

February 1, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu

Links

Abstract / PDF

Why It Matters For Business

ChunkKV lowers GPU memory and speeds long-context LLM serving by keeping semantically coherent chunks and reusing indices across layers; this reduces infrastructure cost and improves latency-sensitive applications.

Summary TLDR

ChunkKV groups consecutive tokens into semantic chunks and evicts or keeps entire chunks from the KV cache. This preserves meaning better than token-level pruning, reduces GPU memory footprint, and speeds inference. Across LongBench, NIAH, GSM8K and JailbreakV, ChunkKV yields notably smaller accuracy drops at aggressive compression, enables a training-free layer-wise index reuse trick, and cuts latency by up to 20.7% and boosts throughput by up to 26.5% versus a full KV cache baseline.

Problem Statement

KV caches use a large share of GPU RAM for long prompts and token-level eviction breaks semantic units, causing fragmented context and worse accuracy. The paper asks: can we compress KV caches while preserving linguistic meaning so accuracy stays high under aggressive compression?

Main Contribution

ChunkKV: treat consecutive tokens as semantic chunks and compress by selecting whole chunks instead of isolated tokens

Layer-wise index reuse: reuse preserved indices across nearby transformer layers to cut compression overhead without retraining

Empirical study: tests on LongBench, Needle-In-A-HayStack (NIAH), GSM8K and JailbreakV across multiple open models, and ablations for chunk size and reuse depth

Key Findings

Chunk-level compression preserves semantics and reduces accuracy loss versus token-level methods.

Numbersup to +8.7% precision at same compression ratio (paper abstract)

ChunkKV increases similarity of preserved indices across adjacent layers.

NumbersJaccard similarity: ChunkKV 57.74% vs SnapKV 27.95% (LLaMA-3-8B, Table 2)

Layer-wise index reuse cuts inference overhead while keeping accuracy nearly intact.

Numberslatency −20.7% and throughput +26.5% vs FullKV; reuse cost down ~20% (Table 8, Sec. 4.3)

Chunk size 10 is a robust default across tasks and models.

Numbersbest or stable performance for chunk sizes 5–20; chosen default = 10 (Table 10, Sec. 4.4)

ChunkKV gives strong latency/throughput and overall inference time vs KV quantization baselines.

NumbersTotal gen time 164.66s vs KIVI 2-bit 226.52s (−27.3% overall time), faster TTFT and TPOT (Table 11)

Results

throughput (tokens/s)

Value+26.5% (max) with ChunkKV_reuse vs FullKV

BaselineFullKV

latency

Value−20.7% (max) with ChunkKV_reuse vs FullKV

BaselineFullKV

Accuracy

Valueup to +8.7% (precision) vs prior methods at same compression ratio

Baselinestate-of-the-art token-level compression methods

index similarity (Jaccard) between adjacent layers

Value57.74% (ChunkKV) vs 27.95% (SnapKV) on LLaMA-3-8B

BaselineSnapKV

total generation time (end-to-end)

Value164.66s (ChunkKV) vs 226.52s (KIVI 2-bit) => −27.3%

BaselineKIVI 2-bit quantization

Who Should Care

What To Try In 7 Days

Implement chunk eviction with chunk size = 10 and an observe window w ∈ {4,8,16,32}

Enable layer-wise index reuse with reuse depth = 2 and measure latency/throughput versus FullKV

Run quick A/B on a critical long-context task (e.g., document QA) to compare accuracy at your target compression ratio

Agent Features

Memory

  • KV cache eviction (chunk-level)

Tool Use

  • layer-wise index reuse

Optimization Features

Token Efficiency

  • keeps recent w tokens plus top-k chunks

Infra Optimization

  • reduces GPU memory footprint by compressing KV cache to target ratios (e.g., 10%)

System Optimization

  • CUDA kernels and memory-aware selection
  • FlashAttention-2 for inference

Inference Optimization

  • KV cache eviction by semantic chunks
  • layer-wise index reuse to reduce compression overhead
  • vectorized chunk scoring and single-pass masking

Reproducibility

Data Urls

  • LongBench
  • Needle-In-A-HayStack (NIAH)
  • GSM8K
  • JailbreakV

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • ChunkKV may lose fine-grained token fidelity needed in legal or biomedical text where every token matters (paper I Limitations).
  • Fixed-size chunks are a simple heuristic; adaptive boundaries might help but add runtime cost.
  • Index reuse depth can hurt performance for some models or math tasks if used too aggressively (see reuse ablations).

When Not To Use

  • When exact token-level fidelity is needed (legal/medical verbatim extraction).
  • When the model is extremely sensitive to small context perturbations and cannot tolerate evictions.
  • When you cannot modify the prefilling pipeline or need to keep a full KV cache for downstream tooling.

Failure Modes

  • Overly large chunk size fragments task-relevant fine detail and reduces accuracy (chunk size 30 showed drops).
  • Reusing indices across too many layers can sharply degrade math reasoning accuracy on some models (see reuse ablation).
  • Domain-specific short facts spread across tokens may be split across chunks and partially lost if chunk boundaries misalign.

Core Entities

Models

  • LLaMA-3-8B-Instruct
  • LLaMA-3.1-8B-Instruct
  • Mistral-7B-Instruct
  • Qwen2-7B-Instruct
  • DeepSeek-R1-Distill-Llama-8B

Metrics

  • throughput (tokens/s)
  • latency (s)
  • Accuracy
  • Jaccard similarity (%)
  • Time to First Token (TTFT, s)
  • Token Processing Time (TPOT, ms/token)
  • total generation time (s)

Datasets

  • LongBench
  • Needle-In-A-HayStack (NIAH)
  • GSM8K
  • JailbreakV

Benchmarks

  • LongBench
  • NIAH
  • GSM8K
  • JailbreakV