KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Overview

Decision SnapshotReady For Pilot

Method is simple, gradient-free, and tested on multiple open models and datasets; correlations to PPL and GPT-scores support practical utility, though integration into a production pipeline needs engineering for large KV dims.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu

Links

Abstract / PDF

Why It Matters For Business

Measuring per-layer, data-dependent compressibility reduces KV-cache memory bandwidth and GPU costs while guiding where to apply compression safely.

Who Should Care

ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

KV-CoRE uses an incremental SVD on cached key/value activations to measure how much the KV-cache of a language model can be reduced without large quality loss. It introduces a simple metric, Normalized Effective Rank (NER), and a robustness metric ND-PPL. Across multiple open models and datasets, keys are more compressible than values, compressibility varies by layer and language, and NER correlates with perplexity and human-aligned GPT-scores, making it a practical diagnostic for per-layer, data-aware KV-cache compression.

Problem Statement

KV-caches reduce recomputation but grow memory-bandwidth cost as context expands. Existing compression methods often ignore that key/value activations depend on data and vary across layers. We need a dataset-level, layer-wise, data-aware way to measure how much KV caches can be compressed without hurting model quality.

Main Contribution

KV-CoRE: an incremental, dataset-level SVD method that computes optimal low-rank approximations of cached keys/values with low memory overhead.

NER (Normalized Effective Rank): a compact, per-layer metric that predicts compressibility and correlates with downstream performance.

Key Findings

Keys are substantially more compressible than values across models and datasets.

NumbersExample: Qwen3-4B avg NER-K=0.424 vs NER-V=0.717 (Table 1).

Practical UsePrioritize key-side compression first: you can shrink key vectors more aggressively with less performance risk.

Evidence RefTable 1; Section 4.2.1

NER strongly predicts performance degradation from truncation.

NumbersCorrelation: values r=0.88, keys r=0.64 between NER and ND-PPL on evaluated datasets.

Practical UseCompute per-layer NER to estimate which layers and datasets will tolerate compression before testing full inference.

Evidence RefFigure 5; Section 4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
NER (keys vs values)	Example: NER-K=0.424 vs NER-V=0.717	uncompressed KV vectors	—	Qwen3-4B averaged over multi-domain English datasets (Table 1)	Table 1 reports layer-averaged NER for keys and values; keys show consistently lower NER.	Table 1; Section 4.2.1
Correlation NER vs ND-PPL	values r=0.88, keys r=0.64 (Pearson)	—	—	Dataset-level across evaluated datasets (Figure 5)	Scatter plots and Pearson r reported for NER vs ND-PPL.	Figure 5; Section 4.4

What To Try In 7 Days

Run KV-CoRE's incremental SVD on a batch of your inference logs to get per-layer NER.

Rank layers by NER and trial stronger key-side truncation on low-NER layers.

Measure ND-PPL and a small GPT-based quality check to validate compression before deployment.

Optimization Features

Token Efficiency

reduces KV memory/bandwidth per token by caching low-dim projections

Infra Optimization

reduces HBM pressure and memory transfer during autoregressive decoding

Model Optimization

data-dependent low-rank truncation of KV projectionsSVD-based recovery of optimal compression matrices

System Optimization

bandwidth reduction for long-context decoding

Inference Optimization

store and transmit down-projected (low-dim) keys during decodingper-layer variable compression ratios

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Requires access to representative tokens to compute dataset-level SVD; results depend on that dataset.

Covariance matrix size grows with KV dimension (m_h * d_h), which can be heavy for very large heads/dims.

When Not To Use

You lack representative dataset logs or cannot share tokens for SVD analysis.

KV dimension is so large that covariance computation is infeasible on available hardware.

Failure Modes

Over-compressing middle layers that actually store crucial high-rank signals.

Interpreting low NER as universally safe compression when it may reflect under-training or poor tokenization.

Core Entities

Models

Qwen3-4BQwen3-8BGemma-2BGemma-7BMistral-7BPhi-3mini-128k-instructLLaMA-2-7B

Metrics

Normalized Effective Rank (NER)Normalized Delta-Perplexity (ND-PPL)Perplexity (PPL)GPT-score

Datasets

AlpacaMedAlpacaCodeAlpacaWizardCoderFunctionCallVisR-Bench multilingual split

Benchmarks

KV-CoRE benchmark (this paper)

Context Entities

Models

LLaMA-2-7B (used for comparison)

Metrics

Effective rank (erank)

Datasets

VisR-Bench (15 languages)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Keys are substantially more compressible than values across models and datasets.

NER strongly predicts performance degradation from truncation.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Prompt caching cuts agent API costs 41–80% and speeds time-to-first-token 13–31%

Key finding