Overview
Method is simple, gradient-free, and tested on multiple open models and datasets; correlations to PPL and GPT-scores support practical utility, though integration into a production pipeline needs engineering for large KV dims.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Measuring per-layer, data-dependent compressibility reduces KV-cache memory bandwidth and GPU costs while guiding where to apply compression safely.
Who Should Care
Summary TLDR
KV-CoRE uses an incremental SVD on cached key/value activations to measure how much the KV-cache of a language model can be reduced without large quality loss. It introduces a simple metric, Normalized Effective Rank (NER), and a robustness metric ND-PPL. Across multiple open models and datasets, keys are more compressible than values, compressibility varies by layer and language, and NER correlates with perplexity and human-aligned GPT-scores, making it a practical diagnostic for per-layer, data-aware KV-cache compression.
Problem Statement
KV-caches reduce recomputation but grow memory-bandwidth cost as context expands. Existing compression methods often ignore that key/value activations depend on data and vary across layers. We need a dataset-level, layer-wise, data-aware way to measure how much KV caches can be compressed without hurting model quality.
Main Contribution
KV-CoRE: an incremental, dataset-level SVD method that computes optimal low-rank approximations of cached keys/values with low memory overhead.
NER (Normalized Effective Rank): a compact, per-layer metric that predicts compressibility and correlates with downstream performance.
Key Findings
Keys are substantially more compressible than values across models and datasets.
NER strongly predicts performance degradation from truncation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| NER (keys vs values) | Example: NER-K=0.424 vs NER-V=0.717 | uncompressed KV vectors | — | Qwen3-4B averaged over multi-domain English datasets (Table 1) | Table 1 reports layer-averaged NER for keys and values; keys show consistently lower NER. | Table 1; Section 4.2.1 |
| Correlation NER vs ND-PPL | values r=0.88, keys r=0.64 (Pearson) | — | — | Dataset-level across evaluated datasets (Figure 5) | Scatter plots and Pearson r reported for NER vs ND-PPL. | Figure 5; Section 4.4 |
What To Try In 7 Days
Run KV-CoRE's incremental SVD on a batch of your inference logs to get per-layer NER.
Rank layers by NER and trial stronger key-side truncation on low-NER layers.
Measure ND-PPL and a small GPT-based quality check to validate compression before deployment.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires access to representative tokens to compute dataset-level SVD; results depend on that dataset.
Covariance matrix size grows with KV dimension (m_h * d_h), which can be heavy for very large heads/dims.
When Not To Use
You lack representative dataset logs or cannot share tokens for SVD analysis.
KV dimension is so large that covariance computation is infeasible on available hardware.
Failure Modes
Over-compressing middle layers that actually store crucial high-rank signals.
Interpreting low NER as universally safe compression when it may reflect under-training or poor tokenization.

