Compress KV cache by keeping semantic chunks (not single tokens) to save memory and speed up long-context LLMs

February 1, 20258 min

Overview

Decision SnapshotReady For Pilot

The method is a practical, training-free eviction strategy with clear implementation steps and consistent gains across public benchmarks; gains are strongest for retrieval and long-context QA and require tuning of chunk size and reuse depth for best trade-offs.

Citations1

Evidence Strength0.78

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu

Links

Abstract / PDF / Data

Why It Matters For Business

ChunkKV lowers GPU memory and speeds long-context LLM serving by keeping semantically coherent chunks and reusing indices across layers; this reduces infrastructure cost and improves latency-sensitive applications.

Who Should Care

Summary TLDR

ChunkKV groups consecutive tokens into semantic chunks and evicts or keeps entire chunks from the KV cache. This preserves meaning better than token-level pruning, reduces GPU memory footprint, and speeds inference. Across LongBench, NIAH, GSM8K and JailbreakV, ChunkKV yields notably smaller accuracy drops at aggressive compression, enables a training-free layer-wise index reuse trick, and cuts latency by up to 20.7% and boosts throughput by up to 26.5% versus a full KV cache baseline.

Problem Statement

KV caches use a large share of GPU RAM for long prompts and token-level eviction breaks semantic units, causing fragmented context and worse accuracy. The paper asks: can we compress KV caches while preserving linguistic meaning so accuracy stays high under aggressive compression?

Main Contribution

ChunkKV: treat consecutive tokens as semantic chunks and compress by selecting whole chunks instead of isolated tokens

Layer-wise index reuse: reuse preserved indices across nearby transformer layers to cut compression overhead without retraining

Key Findings

Chunk-level compression preserves semantics and reduces accuracy loss versus token-level methods.

Numbersup to +8.7% precision at same compression ratio (paper abstract)

Practical UseIf you must compress to a small KV cache, prefer chunk-based selection to keep task accuracy higher.

Evidence RefAbstract; Section 4 (GSM8K, LongBench comparisons)

ChunkKV increases similarity of preserved indices across adjacent layers.

NumbersJaccard similarity: ChunkKV 57.74% vs SnapKV 27.95% (LLaMA-3-8B, Table 2)

Practical UseHigh cross-layer index similarity lets you reuse indices across layers cheaply, saving compute time.

Evidence RefTable 2; Section 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
throughput (tokens/s)+26.5% (max) with ChunkKV_reuse vs FullKVFullKV+26.5%Table 8 (8192 input, 1024 output)ChunkKV_reuse throughput improvement up to 26.5% over FullKV (Table 8)Table 8
latency−20.7% (max) with ChunkKV_reuse vs FullKVFullKV−20.7%Table 8 (8192 input, 1024 output)ChunkKV_reuse latency reduced by up to 20.7% (Table 8)Table 8

What To Try In 7 Days

Implement chunk eviction with chunk size = 10 and an observe window w ∈ {4,8,16,32}

Enable layer-wise index reuse with reuse depth = 2 and measure latency/throughput versus FullKV

Run quick A/B on a critical long-context task (e.g., document QA) to compare accuracy at your target compression ratio

Agent Features

Memory
KV cache eviction (chunk-level)
Tool Use
layer-wise index reuse

Optimization Features

Token Efficiency
keeps recent w tokens plus top-k chunks
Infra Optimization
reduces GPU memory footprint by compressing KV cache to target ratios (e.g., 10%)
System Optimization
CUDA kernels and memory-aware selectionFlashAttention-2 for inference
Inference Optimization
KV cache eviction by semantic chunkslayer-wise index reuse to reduce compression overheadvectorized chunk scoring and single-pass masking

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

LongBenchNeedle-In-A-HayStack (NIAH)GSM8KJailbreakV

Risks & Boundaries

Limitations

ChunkKV may lose fine-grained token fidelity needed in legal or biomedical text where every token matters (paper I Limitations).

Fixed-size chunks are a simple heuristic; adaptive boundaries might help but add runtime cost.

When Not To Use

When exact token-level fidelity is needed (legal/medical verbatim extraction).

When the model is extremely sensitive to small context perturbations and cannot tolerate evictions.

Failure Modes

Overly large chunk size fragments task-relevant fine detail and reduces accuracy (chunk size 30 showed drops).

Reusing indices across too many layers can sharply degrade math reasoning accuracy on some models (see reuse ablation).

Core Entities

Models

LLaMA-3-8B-InstructLLaMA-3.1-8B-InstructMistral-7B-InstructQwen2-7B-InstructDeepSeek-R1-Distill-Llama-8B

Metrics

throughput (tokens/s)latency (s)AccuracyJaccard similarity (%)Time to First Token (TTFT, s)Token Processing Time (TPOT, ms/token)total generation time (s)

Datasets

LongBenchNeedle-In-A-HayStack (NIAH)GSM8KJailbreakV

Benchmarks

LongBenchNIAHGSM8KJailbreakV