Overview
Method is evaluated on two 7B models and three long-context benchmarks; code is released and fine-tuning cost is small, but results are limited to 7B models and selected datasets.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 85%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
CSKV cuts KV-cache memory by ~80% (95% with QAT), enabling much longer context per GPU and lower serving costs with only a short fine-tune.
Who Should Care
Summary TLDR
CSKV compresses transformer KV caches by shrinking the channel dimension with low-rank factors and a bi-branch cache that keeps recent tokens full-precision. With SVD/ASVD initialization and short layer-wise fine-tuning, CSKV cuts KV memory by ~80% while preserving long-context performance on LongEval/LongBench/LVEval. Combined with 4-bit quantization-aware training (QAT) it reaches ~95% total KV reduction with minor accuracy loss. Training cost per 7B model is small (≈90 min on one A100-80G).
Problem Statement
KV cache memory grows linearly with sequence length and quickly becomes the memory bottleneck in long-context tasks (e.g., 200k tokens → ~100GB KV vs 14GB weights). Existing training-free compressions (pruning/quant) hit accuracy limits; retraining-heavy methods compress more but need large training budgets. We need a middle ground: large KV savings with small retraining.
Main Contribution
Bi-branch KV cache: keep most-recent tokens in full precision and store older tokens in low-rank (compressed) channel features.
SVD/ASVD initialization plus layer-wise reconstruction fine-tuning to recover performance with minimal training.
Key Findings
CSKV reduces KV cache memory by about 80% while preserving long-context accuracy.
Combining CSKV with 4-bit quantization-aware training yields up to ~95% total KV compression with small loss.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| KV memory reduction | ≈80% reduction in KV cache size | 0% compression (full-precision KV) | −80% KV memory | General (measured across models in paper) | Abstract, Table 1: achieves 80% compression while keeping long-context ability | Abstract / Table 1 |
| Total compression with quantization | ≈95% total KV reduction (with 4-bit QAT) | 0% compression | −95% KV memory | LongEval / LongBench / LVEval (evaluated jointly) | Table 5 shows 80% + 4-bit QAT → 95% and Avg.Acc ≈0.90 | Table 5 |
What To Try In 7 Days
Run ASVD initialization and CSKV layer-wise fine-tune on a dev 7B model for 1 epoch (≈90 min on A100) and measure KV memory drop.
Set full-precision window ≈32 tokens and test accuracy vs memory to tune the window size.
Combine CSKV with 4-bit QAT on compressed caches if you need >90% KV shrinkage.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Compression ratios per layer are manually chosen; no automatic assignment implemented.
Evaluations focus on 7B models; behavior on much larger models is untested.
When Not To Use
When you cannot run any fine-tuning or calibration (ASVD + short training required).
When zero tolerance for any accuracy loss is required.
Failure Modes
Using random initialization for low-rank factors causes training divergence and model collapse.
Excessive compression of values (vs keys) harms accuracy more; key/value budget must be tuned.

