Compress KV cache by low-rank SVD on KV weight matrices with a layerwise progressive rule

October 4, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen

Links

Abstract / PDF

Why It Matters For Business

LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.

Summary TLDR

LoRC compresses the KV cache by applying truncated SVD to attention key/value weight matrices and choosing per-layer compressed dimensions with a progressive rule based on cumulative condition numbers. The method is plug-and-play (no retraining), works with MHA and GQA attention, and achieves ~55–60% KV memory reduction on LLaMA variants while keeping average task performance loss below 1% on evaluated benchmarks.

Problem Statement

KV cache memory grows with sequence length and batch size and becomes a bottleneck for serving LLMs. Existing fixes either change attention during training or drop tokens at test time; both require model changes or task-specific tuning. We need a simple, post-hoc compression method that reduces KV cache memory without retraining and that avoids amplifying errors across layers.

Main Contribution

A post-hoc, weight-level KV cache compression method using low-rank (truncated SVD) approximation of key and value weight matrices.

A progressive layerwise compression strategy that sets per-layer compressed dimensions using cumulative condition numbers to limit error amplification from shallow layers.

Theoretical error bounds for single-layer approximation and error propagation through a deep network, guiding conservative compression in sensitive layers.

Empirical results on LLaMA variants (8B, 13B, 70B) across four tasks showing ~55–60% KV memory reduction with minimal performance loss and fast SVD runtime.

Key Findings

LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.

Numbers55%–60% compression; avg perf drop <1%

Example per-model reductions: LLaMA-2-13B KV cache from 50G to 27.5G (55%) with 0.47% avg drop.

Numbers50G→27.5G (55%); 0.47% drop

Compressing shallow layers naively causes large accuracy loss (up to 68% drop on LLaMA-3-70B when compressing early blocks).

Numbersup to 68.0% accuracy drop

SVD for all layers in the largest tested model (LLaMA-3-70B, 80 layers) runs quickly: ~40 seconds.

Numbersall-layer SVD in 40s for 70B

Results

KV cache size (LLaMA-2-13B)

Value50G → 27.5G

Baseline50G

KV cache size (LLaMA-3-Instruct-8B)

Value8G → 4.8G

Baseline8G

KV cache size (LLaMA-3-Instruct-70B)

Value20G → 11G

Baseline20G

Average performance drop (across 4 tasks)

Value<1%

Baselinefull-cache model

Who Should Care

What To Try In 7 Days

Run per-layer SVD on your model weights (one-time) to measure singular value decay and per-layer low-rank structure.

Apply LoRC with conservative d_min/d_max and the paper's cumulative-condition threshold to preserve shallow layers.

Benchmark memory savings and task accuracy on 1–2 core workloads (e.g., summarize and QA) to tune thresholds.

Agent Features

Memory

  • reduces KV weight/cache memory

Tool Use

  • SVD
  • weight-level compression

Optimization Features

Token Efficiency

  • no token eviction needed

Infra Optimization

  • lower GPU memory usage enables larger batches/longer context

Model Optimization

  • low-rank SVD on KV weight matrices
  • update query/output matrices to absorb left singular vectors

System Optimization

  • one-time SVD preprocessing (fast)

Training Optimization

  • no retraining required

Inference Optimization

  • reduced KV cache size per layer
  • supports MHA and GQA without model change

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Compressing early (shallow) layers can amplify errors and greatly reduce accuracy.
  • Experiments are limited to LLaMA variants with MHA/GQA and four tasks; other models/tasks untested.
  • Requires setting per-model thresholds (cumulative condition number) which the paper tuned per model.

When Not To Use

  • When you must aggressively compress the first few layers; LoRC recommends keeping shallow layers mostly intact.
  • If you need guarantees on worst-case outputs for safety-critical systems without further validation.
  • When the model architecture differs substantially from tested MHA/GQA implementations without verification.

Failure Modes

  • Uniform compression across layers causes catastrophic drops (example: 68% drop on LLaMA-3-70B shallow-block compression).
  • Improper thresholding may skip compression where it’s safe or compress sensitive layers too much.
  • Edge cases where activation Lipschitz constants are large can increase error amplification beyond theoretical bounds.

Core Entities

Models

  • LLaMA-2-13B
  • LLaMA-3-Instruct-8B
  • LLaMA-3-Instruct-70B

Metrics

  • KV cache size (GB)
  • Compression ratio
  • Accuracy

Datasets

  • BoolQ
  • XSum
  • OpenBookQA
  • GSM8K

Benchmarks

  • ROUGE
  • Accuracy