Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.85
Citation Count
0
Why It Matters For Business
CSKV cuts KV-cache memory by ~80% (95% with QAT), enabling much longer context per GPU and lower serving costs with only a short fine-tune.
Summary TLDR
CSKV compresses transformer KV caches by shrinking the channel dimension with low-rank factors and a bi-branch cache that keeps recent tokens full-precision. With SVD/ASVD initialization and short layer-wise fine-tuning, CSKV cuts KV memory by ~80% while preserving long-context performance on LongEval/LongBench/LVEval. Combined with 4-bit quantization-aware training (QAT) it reaches ~95% total KV reduction with minor accuracy loss. Training cost per 7B model is small (≈90 min on one A100-80G).
Problem Statement
KV cache memory grows linearly with sequence length and quickly becomes the memory bottleneck in long-context tasks (e.g., 200k tokens → ~100GB KV vs 14GB weights). Existing training-free compressions (pruning/quant) hit accuracy limits; retraining-heavy methods compress more but need large training budgets. We need a middle ground: large KV savings with small retraining.
Main Contribution
Bi-branch KV cache: keep most-recent tokens in full precision and store older tokens in low-rank (compressed) channel features.
SVD/ASVD initialization plus layer-wise reconstruction fine-tuning to recover performance with minimal training.
Demonstrated ≈80% KV memory reduction while keeping long-context ability on multiple benchmarks.
Shows compatibility with 4-bit quantization (QAT) to achieve up to ≈95% total KV compression.
Key Findings
CSKV reduces KV cache memory by about 80% while preserving long-context accuracy.
Combining CSKV with 4-bit quantization-aware training yields up to ~95% total KV compression with small loss.
SVD-based initialization (ASVD) and layer-wise reconstruction fine-tuning are critical; random init fails.
A small full-precision window (≈32 tokens) preserves most local info; larger windows give diminishing returns.
Results
KV memory reduction
Total compression with quantization
Training cost for 7B models
Who Should Care
What To Try In 7 Days
Run ASVD initialization and CSKV layer-wise fine-tune on a dev 7B model for 1 epoch (≈90 min on A100) and measure KV memory drop.
Set full-precision window ≈32 tokens and test accuracy vs memory to tune the window size.
Combine CSKV with 4-bit QAT on compressed caches if you need >90% KV shrinkage.
Optimization Features
Token Efficiency
- Preserve recent tokens; compress historical tokens
Infra Optimization
- Reduces GPU memory footprint for long-context serving
Model Optimization
- KV Cache Optimization
- Low-rank decomposition (channel shrinking)
System Optimization
- Compatible with 4-bit QAT for further memory reduction
Training Optimization
- SVD/ASVD initialization
- Layer-wise reconstruction fine-tuning (MSE loss)
Inference Optimization
- Bi-branch cache: small full-precision window + compressed history
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Compression ratios per layer are manually chosen; no automatic assignment implemented.
- Evaluations focus on 7B models; behavior on much larger models is untested.
- PTQ on compressed representations fails; QAT needed for safe quantization.
When Not To Use
- When you cannot run any fine-tuning or calibration (ASVD + short training required).
- When zero tolerance for any accuracy loss is required.
- When operating on models or hardware configurations not validated by the paper (beyond tested 7B setups).
Failure Modes
- Using random initialization for low-rank factors causes training divergence and model collapse.
- Excessive compression of values (vs keys) harms accuracy more; key/value budget must be tuned.
- Applying naive post-training quantization (PTQ) on compressed activations can break performance.
Core Entities
Models
- LongChat-7B-v1.5-32k
- Mistral-7B-Instruct-v0.2
Metrics
- Accuracy
- Memory compression ratio
Datasets
- Pile (scaled subset)
- LongEval
- LongBench
- LVEval
Benchmarks
- LongEval
- LongBench
- LVEval
Context Entities
Models
- LLaMA-2-7B (mentioned for KV size example)
Metrics
- Accuracy
Datasets
- The Pile (used to sample data for singular value analysis)
Benchmarks
- MMLU (singular-value removal experiment mentioned)

