Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
ChunkKV lowers GPU memory and speeds long-context LLM serving by keeping semantically coherent chunks and reusing indices across layers; this reduces infrastructure cost and improves latency-sensitive applications.
Summary TLDR
ChunkKV groups consecutive tokens into semantic chunks and evicts or keeps entire chunks from the KV cache. This preserves meaning better than token-level pruning, reduces GPU memory footprint, and speeds inference. Across LongBench, NIAH, GSM8K and JailbreakV, ChunkKV yields notably smaller accuracy drops at aggressive compression, enables a training-free layer-wise index reuse trick, and cuts latency by up to 20.7% and boosts throughput by up to 26.5% versus a full KV cache baseline.
Problem Statement
KV caches use a large share of GPU RAM for long prompts and token-level eviction breaks semantic units, causing fragmented context and worse accuracy. The paper asks: can we compress KV caches while preserving linguistic meaning so accuracy stays high under aggressive compression?
Main Contribution
ChunkKV: treat consecutive tokens as semantic chunks and compress by selecting whole chunks instead of isolated tokens
Layer-wise index reuse: reuse preserved indices across nearby transformer layers to cut compression overhead without retraining
Empirical study: tests on LongBench, Needle-In-A-HayStack (NIAH), GSM8K and JailbreakV across multiple open models, and ablations for chunk size and reuse depth
Key Findings
Chunk-level compression preserves semantics and reduces accuracy loss versus token-level methods.
ChunkKV increases similarity of preserved indices across adjacent layers.
Layer-wise index reuse cuts inference overhead while keeping accuracy nearly intact.
Chunk size 10 is a robust default across tasks and models.
ChunkKV gives strong latency/throughput and overall inference time vs KV quantization baselines.
Results
throughput (tokens/s)
latency
Accuracy
index similarity (Jaccard) between adjacent layers
total generation time (end-to-end)
Who Should Care
What To Try In 7 Days
Implement chunk eviction with chunk size = 10 and an observe window w ∈ {4,8,16,32}
Enable layer-wise index reuse with reuse depth = 2 and measure latency/throughput versus FullKV
Run quick A/B on a critical long-context task (e.g., document QA) to compare accuracy at your target compression ratio
Agent Features
Memory
- KV cache eviction (chunk-level)
Tool Use
- layer-wise index reuse
Optimization Features
Token Efficiency
- keeps recent w tokens plus top-k chunks
Infra Optimization
- reduces GPU memory footprint by compressing KV cache to target ratios (e.g., 10%)
System Optimization
- CUDA kernels and memory-aware selection
- FlashAttention-2 for inference
Inference Optimization
- KV cache eviction by semantic chunks
- layer-wise index reuse to reduce compression overhead
- vectorized chunk scoring and single-pass masking
Reproducibility
Data Urls
- LongBench
- Needle-In-A-HayStack (NIAH)
- GSM8K
- JailbreakV
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- ChunkKV may lose fine-grained token fidelity needed in legal or biomedical text where every token matters (paper I Limitations).
- Fixed-size chunks are a simple heuristic; adaptive boundaries might help but add runtime cost.
- Index reuse depth can hurt performance for some models or math tasks if used too aggressively (see reuse ablations).
When Not To Use
- When exact token-level fidelity is needed (legal/medical verbatim extraction).
- When the model is extremely sensitive to small context perturbations and cannot tolerate evictions.
- When you cannot modify the prefilling pipeline or need to keep a full KV cache for downstream tooling.
Failure Modes
- Overly large chunk size fragments task-relevant fine detail and reduces accuracy (chunk size 30 showed drops).
- Reusing indices across too many layers can sharply degrade math reasoning accuracy on some models (see reuse ablation).
- Domain-specific short facts spread across tokens may be split across chunks and partially lost if chunk boundaries misalign.
Core Entities
Models
- LLaMA-3-8B-Instruct
- LLaMA-3.1-8B-Instruct
- Mistral-7B-Instruct
- Qwen2-7B-Instruct
- DeepSeek-R1-Distill-Llama-8B
Metrics
- throughput (tokens/s)
- latency (s)
- Accuracy
- Jaccard similarity (%)
- Time to First Token (TTFT, s)
- Token Processing Time (TPOT, ms/token)
- total generation time (s)
Datasets
- LongBench
- Needle-In-A-HayStack (NIAH)
- GSM8K
- JailbreakV
Benchmarks
- LongBench
- NIAH
- GSM8K
- JailbreakV

