Cut KV-cache memory 80–95% with a light fine-tune using low-rank channel shrinking

September 16, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.85

Citation Count

0

Authors

Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang

Links

Abstract / PDF

Why It Matters For Business

CSKV cuts KV-cache memory by ~80% (95% with QAT), enabling much longer context per GPU and lower serving costs with only a short fine-tune.

Summary TLDR

CSKV compresses transformer KV caches by shrinking the channel dimension with low-rank factors and a bi-branch cache that keeps recent tokens full-precision. With SVD/ASVD initialization and short layer-wise fine-tuning, CSKV cuts KV memory by ~80% while preserving long-context performance on LongEval/LongBench/LVEval. Combined with 4-bit quantization-aware training (QAT) it reaches ~95% total KV reduction with minor accuracy loss. Training cost per 7B model is small (≈90 min on one A100-80G).

Problem Statement

KV cache memory grows linearly with sequence length and quickly becomes the memory bottleneck in long-context tasks (e.g., 200k tokens → ~100GB KV vs 14GB weights). Existing training-free compressions (pruning/quant) hit accuracy limits; retraining-heavy methods compress more but need large training budgets. We need a middle ground: large KV savings with small retraining.

Main Contribution

Bi-branch KV cache: keep most-recent tokens in full precision and store older tokens in low-rank (compressed) channel features.

SVD/ASVD initialization plus layer-wise reconstruction fine-tuning to recover performance with minimal training.

Demonstrated ≈80% KV memory reduction while keeping long-context ability on multiple benchmarks.

Shows compatibility with 4-bit quantization (QAT) to achieve up to ≈95% total KV compression.

Key Findings

CSKV reduces KV cache memory by about 80% while preserving long-context accuracy.

Numbers80% KV compression → Avg. accuracies ~0.90–0.94 on LongEval subsets (Table 1).

Combining CSKV with 4-bit quantization-aware training yields up to ~95% total KV compression with small loss.

Numbers80% + 4-bit QAT → 95% total compression and Avg.Acc ≈0.90 (Table 5).

SVD-based initialization (ASVD) and layer-wise reconstruction fine-tuning are critical; random init fails.

NumbersAt 80% compression, random init yields near-zero accuracy; ASVD init yields Avg.Acc ≈0.92 after fine-tuning (Appendix).

A small full-precision window (≈32 tokens) preserves most local info; larger windows give diminishing returns.

NumbersAccuracy rises quickly to 0.92 at window=32 and plateaus after ~128 (Table 3).

Results

KV memory reduction

Value≈80% reduction in KV cache size

Baseline0% compression (full-precision KV)

Total compression with quantization

Value≈95% total KV reduction (with 4-bit QAT)

Baseline0% compression

Training cost for 7B models

Value≈90 minutes on a single NVIDIA A100-80G

Baselineretraining entire model (much higher)

Who Should Care

What To Try In 7 Days

Run ASVD initialization and CSKV layer-wise fine-tune on a dev 7B model for 1 epoch (≈90 min on A100) and measure KV memory drop.

Set full-precision window ≈32 tokens and test accuracy vs memory to tune the window size.

Combine CSKV with 4-bit QAT on compressed caches if you need >90% KV shrinkage.

Optimization Features

Token Efficiency

  • Preserve recent tokens; compress historical tokens

Infra Optimization

  • Reduces GPU memory footprint for long-context serving

Model Optimization

  • KV Cache Optimization
  • Low-rank decomposition (channel shrinking)

System Optimization

  • Compatible with 4-bit QAT for further memory reduction

Training Optimization

  • SVD/ASVD initialization
  • Layer-wise reconstruction fine-tuning (MSE loss)

Inference Optimization

  • Bi-branch cache: small full-precision window + compressed history

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Compression ratios per layer are manually chosen; no automatic assignment implemented.
  • Evaluations focus on 7B models; behavior on much larger models is untested.
  • PTQ on compressed representations fails; QAT needed for safe quantization.

When Not To Use

  • When you cannot run any fine-tuning or calibration (ASVD + short training required).
  • When zero tolerance for any accuracy loss is required.
  • When operating on models or hardware configurations not validated by the paper (beyond tested 7B setups).

Failure Modes

  • Using random initialization for low-rank factors causes training divergence and model collapse.
  • Excessive compression of values (vs keys) harms accuracy more; key/value budget must be tuned.
  • Applying naive post-training quantization (PTQ) on compressed activations can break performance.

Core Entities

Models

  • LongChat-7B-v1.5-32k
  • Mistral-7B-Instruct-v0.2

Metrics

  • Accuracy
  • Memory compression ratio

Datasets

  • Pile (scaled subset)
  • LongEval
  • LongBench
  • LVEval

Benchmarks

  • LongEval
  • LongBench
  • LVEval

Context Entities

Models

  • LLaMA-2-7B (mentioned for KV size example)

Metrics

  • Accuracy

Datasets

  • The Pile (used to sample data for singular value analysis)

Benchmarks

  • MMLU (singular-value removal experiment mentioned)