Compress KV cache by low-rank SVD on KV weight matrices with a layerwise progressive rule

October 4, 20247 min

Overview

Decision SnapshotNeeds Validation

Method is simple and plug-and-play, demonstrated on three LLaMA variants and four tasks; theoretical bounds and fast SVD support deployment, but evaluations are limited to those models and tasks.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen

Links

Abstract / PDF

Why It Matters For Business

LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.

Who Should Care

Summary TLDR

LoRC compresses the KV cache by applying truncated SVD to attention key/value weight matrices and choosing per-layer compressed dimensions with a progressive rule based on cumulative condition numbers. The method is plug-and-play (no retraining), works with MHA and GQA attention, and achieves ~55–60% KV memory reduction on LLaMA variants while keeping average task performance loss below 1% on evaluated benchmarks.

Problem Statement

KV cache memory grows with sequence length and batch size and becomes a bottleneck for serving LLMs. Existing fixes either change attention during training or drop tokens at test time; both require model changes or task-specific tuning. We need a simple, post-hoc compression method that reduces KV cache memory without retraining and that avoids amplifying errors across layers.

Main Contribution

A post-hoc, weight-level KV cache compression method using low-rank (truncated SVD) approximation of key and value weight matrices.

A progressive layerwise compression strategy that sets per-layer compressed dimensions using cumulative condition numbers to limit error amplification from shallow layers.

Key Findings

LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.

Numbers55%–60% compression; avg perf drop <1%

Practical UseYou can cut KV memory roughly in half for LLaMA-like models and retain similar task accuracy on common benchmarks.

Evidence RefTable 2; Sec 6.4

Example per-model reductions: LLaMA-2-13B KV cache from 50G to 27.5G (55%) with 0.47% avg drop.

Numbers50G→27.5G (55%); 0.47% drop

Practical UseFor a 13B LLaMA, expect tens of gigabytes reclaimed on common batch/seq settings while keeping accuracy nearly intact.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
KV cache size (LLaMA-2-13B)50G → 27.5G50G−45%batch size 64, seq len 2048 (Table 2)Table 2 reports reduction to 27.5G at 55% compression ratioTable 2
KV cache size (LLaMA-3-Instruct-8B)8G → 4.8G8G−40%batch size 64, seq len 2048 (Table 2)Table 2 reports reduction to 4.8G at 60% compression ratioTable 2

What To Try In 7 Days

Run per-layer SVD on your model weights (one-time) to measure singular value decay and per-layer low-rank structure.

Apply LoRC with conservative d_min/d_max and the paper's cumulative-condition threshold to preserve shallow layers.

Benchmark memory savings and task accuracy on 1–2 core workloads (e.g., summarize and QA) to tune thresholds.

Agent Features

Memory
reduces KV weight/cache memory
Tool Use
SVDweight-level compression

Optimization Features

Token Efficiency
no token eviction needed
Infra Optimization
lower GPU memory usage enables larger batches/longer context
Model Optimization
low-rank SVD on KV weight matricesupdate query/output matrices to absorb left singular vectors
System Optimization
one-time SVD preprocessing (fast)
Training Optimization
no retraining required
Inference Optimization
reduced KV cache size per layersupports MHA and GQA without model change

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Compressing early (shallow) layers can amplify errors and greatly reduce accuracy.

Experiments are limited to LLaMA variants with MHA/GQA and four tasks; other models/tasks untested.

When Not To Use

When you must aggressively compress the first few layers; LoRC recommends keeping shallow layers mostly intact.

If you need guarantees on worst-case outputs for safety-critical systems without further validation.

Failure Modes

Uniform compression across layers causes catastrophic drops (example: 68% drop on LLaMA-3-70B shallow-block compression).

Improper thresholding may skip compression where it’s safe or compress sensitive layers too much.

Core Entities

Models

LLaMA-2-13BLLaMA-3-Instruct-8BLLaMA-3-Instruct-70B

Metrics

KV cache size (GB)Compression ratioAccuracy

Datasets

BoolQXSumOpenBookQAGSM8K

Benchmarks

ROUGEAccuracy