Compress KV cache by low-rank SVD on KV weight matrices with a layerwise progressive rule

Overview

Decision SnapshotNeeds Validation

Method is simple and plug-and-play, demonstrated on three LLaMA variants and four tasks; theoretical bounds and fast SVD support deployment, but evaluations are limited to those models and tasks.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen

Links

Abstract / PDF

Why It Matters For Business

LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

LoRC compresses the KV cache by applying truncated SVD to attention key/value weight matrices and choosing per-layer compressed dimensions with a progressive rule based on cumulative condition numbers. The method is plug-and-play (no retraining), works with MHA and GQA attention, and achieves ~55–60% KV memory reduction on LLaMA variants while keeping average task performance loss below 1% on evaluated benchmarks.

Problem Statement

KV cache memory grows with sequence length and batch size and becomes a bottleneck for serving LLMs. Existing fixes either change attention during training or drop tokens at test time; both require model changes or task-specific tuning. We need a simple, post-hoc compression method that reduces KV cache memory without retraining and that avoids amplifying errors across layers.

Main Contribution

A post-hoc, weight-level KV cache compression method using low-rank (truncated SVD) approximation of key and value weight matrices.

A progressive layerwise compression strategy that sets per-layer compressed dimensions using cumulative condition numbers to limit error amplification from shallow layers.

Key Findings

LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.

Numbers55%–60% compression; avg perf drop <1%

Practical UseYou can cut KV memory roughly in half for LLaMA-like models and retain similar task accuracy on common benchmarks.

Evidence RefTable 2; Sec 6.4

Example per-model reductions: LLaMA-2-13B KV cache from 50G to 27.5G (55%) with 0.47% avg drop.

Numbers50G→27.5G (55%); 0.47% drop

Practical UseFor a 13B LLaMA, expect tens of gigabytes reclaimed on common batch/seq settings while keeping accuracy nearly intact.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
KV cache size (LLaMA-2-13B)	50G → 27.5G	50G	−45%	batch size 64, seq len 2048 (Table 2)	Table 2 reports reduction to 27.5G at 55% compression ratio	Table 2
KV cache size (LLaMA-3-Instruct-8B)	8G → 4.8G	8G	−40%	batch size 64, seq len 2048 (Table 2)	Table 2 reports reduction to 4.8G at 60% compression ratio	Table 2

What To Try In 7 Days

Run per-layer SVD on your model weights (one-time) to measure singular value decay and per-layer low-rank structure.

Apply LoRC with conservative d_min/d_max and the paper's cumulative-condition threshold to preserve shallow layers.

Benchmark memory savings and task accuracy on 1–2 core workloads (e.g., summarize and QA) to tune thresholds.

Agent Features

Memory

reduces KV weight/cache memory

Tool Use

SVDweight-level compression

Optimization Features

Token Efficiency

no token eviction needed

Infra Optimization

lower GPU memory usage enables larger batches/longer context

Model Optimization

low-rank SVD on KV weight matricesupdate query/output matrices to absorb left singular vectors

System Optimization

one-time SVD preprocessing (fast)

Training Optimization

no retraining required

Inference Optimization

reduced KV cache size per layersupports MHA and GQA without model change

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Compressing early (shallow) layers can amplify errors and greatly reduce accuracy.

Experiments are limited to LLaMA variants with MHA/GQA and four tasks; other models/tasks untested.

When Not To Use

When you must aggressively compress the first few layers; LoRC recommends keeping shallow layers mostly intact.

If you need guarantees on worst-case outputs for safety-critical systems without further validation.

Failure Modes

Uniform compression across layers causes catastrophic drops (example: 68% drop on LLaMA-3-70B shallow-block compression).

Improper thresholding may skip compression where it’s safe or compress sensitive layers too much.

Core Entities

Models

LLaMA-2-13BLLaMA-3-Instruct-8BLLaMA-3-Instruct-70B

Metrics

KV cache size (GB)Compression ratioAccuracy

Datasets

BoolQXSumOpenBookQAGSM8K

Benchmarks

ROUGEAccuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.

Example per-model reductions: LLaMA-2-13B KV cache from 50G to 27.5G (55%) with 0.47% avg drop.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding