Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Measuring per-layer, data-dependent compressibility reduces KV-cache memory bandwidth and GPU costs while guiding where to apply compression safely.
Summary TLDR
KV-CoRE uses an incremental SVD on cached key/value activations to measure how much the KV-cache of a language model can be reduced without large quality loss. It introduces a simple metric, Normalized Effective Rank (NER), and a robustness metric ND-PPL. Across multiple open models and datasets, keys are more compressible than values, compressibility varies by layer and language, and NER correlates with perplexity and human-aligned GPT-scores, making it a practical diagnostic for per-layer, data-aware KV-cache compression.
Problem Statement
KV-caches reduce recomputation but grow memory-bandwidth cost as context expands. Existing compression methods often ignore that key/value activations depend on data and vary across layers. We need a dataset-level, layer-wise, data-aware way to measure how much KV caches can be compressed without hurting model quality.
Main Contribution
KV-CoRE: an incremental, dataset-level SVD method that computes optimal low-rank approximations of cached keys/values with low memory overhead.
NER (Normalized Effective Rank): a compact, per-layer metric that predicts compressibility and correlates with downstream performance.
ND-PPL: a dataset-agnostic robustness metric that quantifies end-to-end perplexity change under KV truncation.
A large-scale empirical benchmark across multiple open LLMs, domains, and 15 languages, exposing layer-wise and language-dependent compressibility patterns.
Key Findings
Keys are substantially more compressible than values across models and datasets.
NER strongly predicts performance degradation from truncation.
Compressibility varies by layer: middle layers often use rank more fully, early and late layers are more compressible.
Larger KV capacity can be under-used and thus more compressible.
Low-resource languages often show 'rank collapse' with low NER, especially in values.
Results
NER (keys vs values)
Correlation NER vs ND-PPL
Layer-wise pattern
Model capacity vs compressibility
Who Should Care
What To Try In 7 Days
Run KV-CoRE's incremental SVD on a batch of your inference logs to get per-layer NER.
Rank layers by NER and trial stronger key-side truncation on low-NER layers.
Measure ND-PPL and a small GPT-based quality check to validate compression before deployment.
Optimization Features
Token Efficiency
- reduces KV memory/bandwidth per token by caching low-dim projections
Infra Optimization
- reduces HBM pressure and memory transfer during autoregressive decoding
Model Optimization
- data-dependent low-rank truncation of KV projections
- SVD-based recovery of optimal compression matrices
System Optimization
- bandwidth reduction for long-context decoding
Inference Optimization
- store and transmit down-projected (low-dim) keys during decoding
- per-layer variable compression ratios
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires access to representative tokens to compute dataset-level SVD; results depend on that dataset.
- Covariance matrix size grows with KV dimension (m_h * d_h), which can be heavy for very large heads/dims.
- Benchmark assesses compressibility and PPL/GPT-based quality, not all downstream task metrics.
- GPT-score evaluation was limited (100 instructions), so human-aligned quality estimates are approximate.
When Not To Use
- You lack representative dataset logs or cannot share tokens for SVD analysis.
- KV dimension is so large that covariance computation is infeasible on available hardware.
- You need guaranteed zero-loss compression for safety-critical outputs.
Failure Modes
- Over-compressing middle layers that actually store crucial high-rank signals.
- Interpreting low NER as universally safe compression when it may reflect under-training or poor tokenization.
- Model-specific behavior may break the NER vs PPL correlation on unseen tasks.
Core Entities
Models
- Qwen3-4B
- Qwen3-8B
- Gemma-2B
- Gemma-7B
- Mistral-7B
- Phi-3mini-128k-instruct
- LLaMA-2-7B
Metrics
- Normalized Effective Rank (NER)
- Normalized Delta-Perplexity (ND-PPL)
- Perplexity (PPL)
- GPT-score
Datasets
- Alpaca
- MedAlpaca
- CodeAlpaca
- WizardCoder
- FunctionCall
- VisR-Bench multilingual split
Benchmarks
- KV-CoRE benchmark (this paper)
Context Entities
Models
- LLaMA-2-7B (used for comparison)
Metrics
- Effective rank (erank)
Datasets
- VisR-Bench (15 languages)

