KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

February 5, 20267 min

Overview

Decision SnapshotReady For Pilot

Method is simple, gradient-free, and tested on multiple open models and datasets; correlations to PPL and GPT-scores support practical utility, though integration into a production pipeline needs engineering for large KV dims.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu

Links

Abstract / PDF

Why It Matters For Business

Measuring per-layer, data-dependent compressibility reduces KV-cache memory bandwidth and GPU costs while guiding where to apply compression safely.

Who Should Care

Summary TLDR

KV-CoRE uses an incremental SVD on cached key/value activations to measure how much the KV-cache of a language model can be reduced without large quality loss. It introduces a simple metric, Normalized Effective Rank (NER), and a robustness metric ND-PPL. Across multiple open models and datasets, keys are more compressible than values, compressibility varies by layer and language, and NER correlates with perplexity and human-aligned GPT-scores, making it a practical diagnostic for per-layer, data-aware KV-cache compression.

Problem Statement

KV-caches reduce recomputation but grow memory-bandwidth cost as context expands. Existing compression methods often ignore that key/value activations depend on data and vary across layers. We need a dataset-level, layer-wise, data-aware way to measure how much KV caches can be compressed without hurting model quality.

Main Contribution

KV-CoRE: an incremental, dataset-level SVD method that computes optimal low-rank approximations of cached keys/values with low memory overhead.

NER (Normalized Effective Rank): a compact, per-layer metric that predicts compressibility and correlates with downstream performance.

Key Findings

Keys are substantially more compressible than values across models and datasets.

NumbersExample: Qwen3-4B avg NER-K=0.424 vs NER-V=0.717 (Table 1).

Practical UsePrioritize key-side compression first: you can shrink key vectors more aggressively with less performance risk.

Evidence RefTable 1; Section 4.2.1

NER strongly predicts performance degradation from truncation.

NumbersCorrelation: values r=0.88, keys r=0.64 between NER and ND-PPL on evaluated datasets.

Practical UseCompute per-layer NER to estimate which layers and datasets will tolerate compression before testing full inference.

Evidence RefFigure 5; Section 4.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
NER (keys vs values)Example: NER-K=0.424 vs NER-V=0.717uncompressed KV vectorsQwen3-4B averaged over multi-domain English datasets (Table 1)Table 1 reports layer-averaged NER for keys and values; keys show consistently lower NER.Table 1; Section 4.2.1
Correlation NER vs ND-PPLvalues r=0.88, keys r=0.64 (Pearson)Dataset-level across evaluated datasets (Figure 5)Scatter plots and Pearson r reported for NER vs ND-PPL.Figure 5; Section 4.4

What To Try In 7 Days

Run KV-CoRE's incremental SVD on a batch of your inference logs to get per-layer NER.

Rank layers by NER and trial stronger key-side truncation on low-NER layers.

Measure ND-PPL and a small GPT-based quality check to validate compression before deployment.

Optimization Features

Token Efficiency
reduces KV memory/bandwidth per token by caching low-dim projections
Infra Optimization
reduces HBM pressure and memory transfer during autoregressive decoding
Model Optimization
data-dependent low-rank truncation of KV projectionsSVD-based recovery of optimal compression matrices
System Optimization
bandwidth reduction for long-context decoding
Inference Optimization
store and transmit down-projected (low-dim) keys during decodingper-layer variable compression ratios

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires access to representative tokens to compute dataset-level SVD; results depend on that dataset.

Covariance matrix size grows with KV dimension (m_h * d_h), which can be heavy for very large heads/dims.

When Not To Use

You lack representative dataset logs or cannot share tokens for SVD analysis.

KV dimension is so large that covariance computation is infeasible on available hardware.

Failure Modes

Over-compressing middle layers that actually store crucial high-rank signals.

Interpreting low NER as universally safe compression when it may reflect under-training or poor tokenization.

Core Entities

Models

Qwen3-4BQwen3-8BGemma-2BGemma-7BMistral-7BPhi-3mini-128k-instructLLaMA-2-7B

Metrics

Normalized Effective Rank (NER)Normalized Delta-Perplexity (ND-PPL)Perplexity (PPL)GPT-score

Datasets

AlpacaMedAlpacaCodeAlpacaWizardCoderFunctionCallVisR-Bench multilingual split

Benchmarks

KV-CoRE benchmark (this paper)

Context Entities

Models

LLaMA-2-7B (used for comparison)

Metrics

Effective rank (erank)

Datasets

VisR-Bench (15 languages)