KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

February 5, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu

Links

Abstract / PDF

Why It Matters For Business

Measuring per-layer, data-dependent compressibility reduces KV-cache memory bandwidth and GPU costs while guiding where to apply compression safely.

Summary TLDR

KV-CoRE uses an incremental SVD on cached key/value activations to measure how much the KV-cache of a language model can be reduced without large quality loss. It introduces a simple metric, Normalized Effective Rank (NER), and a robustness metric ND-PPL. Across multiple open models and datasets, keys are more compressible than values, compressibility varies by layer and language, and NER correlates with perplexity and human-aligned GPT-scores, making it a practical diagnostic for per-layer, data-aware KV-cache compression.

Problem Statement

KV-caches reduce recomputation but grow memory-bandwidth cost as context expands. Existing compression methods often ignore that key/value activations depend on data and vary across layers. We need a dataset-level, layer-wise, data-aware way to measure how much KV caches can be compressed without hurting model quality.

Main Contribution

KV-CoRE: an incremental, dataset-level SVD method that computes optimal low-rank approximations of cached keys/values with low memory overhead.

NER (Normalized Effective Rank): a compact, per-layer metric that predicts compressibility and correlates with downstream performance.

ND-PPL: a dataset-agnostic robustness metric that quantifies end-to-end perplexity change under KV truncation.

A large-scale empirical benchmark across multiple open LLMs, domains, and 15 languages, exposing layer-wise and language-dependent compressibility patterns.

Key Findings

Keys are substantially more compressible than values across models and datasets.

NumbersExample: Qwen3-4B avg NER-K=0.424 vs NER-V=0.717 (Table 1).

NER strongly predicts performance degradation from truncation.

NumbersCorrelation: values r=0.88, keys r=0.64 between NER and ND-PPL on evaluated datasets.

Compressibility varies by layer: middle layers often use rank more fully, early and late layers are more compressible.

NumbersLayer-wise NER plots show consistent peaks in middle layers (Figure 2; Appendix B.1).

Larger KV capacity can be under-used and thus more compressible.

NumbersGemma-7B avg NER-K≈0.337 vs Gemma-2B avg NER-K≈0.597 (Table 1).

Low-resource languages often show 'rank collapse' with low NER, especially in values.

NumbersExample: Arabic NER-K=0.337, NER-V=0.582 vs Czech NER-K=0.383 (Table 1 multilingual).

Results

NER (keys vs values)

ValueExample: NER-K=0.424 vs NER-V=0.717

Baselineuncompressed KV vectors

Correlation NER vs ND-PPL

Valuevalues r=0.88, keys r=0.64 (Pearson)

Layer-wise pattern

Valuemiddle layers higher NER; early/late layers lower

Model capacity vs compressibility

ValueGemma-7B NER-K≈0.337 vs Gemma-2B NER-K≈0.597

Who Should Care

What To Try In 7 Days

Run KV-CoRE's incremental SVD on a batch of your inference logs to get per-layer NER.

Rank layers by NER and trial stronger key-side truncation on low-NER layers.

Measure ND-PPL and a small GPT-based quality check to validate compression before deployment.

Optimization Features

Token Efficiency

  • reduces KV memory/bandwidth per token by caching low-dim projections

Infra Optimization

  • reduces HBM pressure and memory transfer during autoregressive decoding

Model Optimization

  • data-dependent low-rank truncation of KV projections
  • SVD-based recovery of optimal compression matrices

System Optimization

  • bandwidth reduction for long-context decoding

Inference Optimization

  • store and transmit down-projected (low-dim) keys during decoding
  • per-layer variable compression ratios

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires access to representative tokens to compute dataset-level SVD; results depend on that dataset.
  • Covariance matrix size grows with KV dimension (m_h * d_h), which can be heavy for very large heads/dims.
  • Benchmark assesses compressibility and PPL/GPT-based quality, not all downstream task metrics.
  • GPT-score evaluation was limited (100 instructions), so human-aligned quality estimates are approximate.

When Not To Use

  • You lack representative dataset logs or cannot share tokens for SVD analysis.
  • KV dimension is so large that covariance computation is infeasible on available hardware.
  • You need guaranteed zero-loss compression for safety-critical outputs.

Failure Modes

  • Over-compressing middle layers that actually store crucial high-rank signals.
  • Interpreting low NER as universally safe compression when it may reflect under-training or poor tokenization.
  • Model-specific behavior may break the NER vs PPL correlation on unseen tasks.

Core Entities

Models

  • Qwen3-4B
  • Qwen3-8B
  • Gemma-2B
  • Gemma-7B
  • Mistral-7B
  • Phi-3mini-128k-instruct
  • LLaMA-2-7B

Metrics

  • Normalized Effective Rank (NER)
  • Normalized Delta-Perplexity (ND-PPL)
  • Perplexity (PPL)
  • GPT-score

Datasets

  • Alpaca
  • MedAlpaca
  • CodeAlpaca
  • WizardCoder
  • FunctionCall
  • VisR-Bench multilingual split

Benchmarks

  • KV-CoRE benchmark (this paper)

Context Entities

Models

  • LLaMA-2-7B (used for comparison)

Metrics

  • Effective rank (erank)

Datasets

  • VisR-Bench (15 languages)