Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.
Summary TLDR
LoRC compresses the KV cache by applying truncated SVD to attention key/value weight matrices and choosing per-layer compressed dimensions with a progressive rule based on cumulative condition numbers. The method is plug-and-play (no retraining), works with MHA and GQA attention, and achieves ~55–60% KV memory reduction on LLaMA variants while keeping average task performance loss below 1% on evaluated benchmarks.
Problem Statement
KV cache memory grows with sequence length and batch size and becomes a bottleneck for serving LLMs. Existing fixes either change attention during training or drop tokens at test time; both require model changes or task-specific tuning. We need a simple, post-hoc compression method that reduces KV cache memory without retraining and that avoids amplifying errors across layers.
Main Contribution
A post-hoc, weight-level KV cache compression method using low-rank (truncated SVD) approximation of key and value weight matrices.
A progressive layerwise compression strategy that sets per-layer compressed dimensions using cumulative condition numbers to limit error amplification from shallow layers.
Theoretical error bounds for single-layer approximation and error propagation through a deep network, guiding conservative compression in sensitive layers.
Empirical results on LLaMA variants (8B, 13B, 70B) across four tasks showing ~55–60% KV memory reduction with minimal performance loss and fast SVD runtime.
Key Findings
LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.
Example per-model reductions: LLaMA-2-13B KV cache from 50G to 27.5G (55%) with 0.47% avg drop.
Compressing shallow layers naively causes large accuracy loss (up to 68% drop on LLaMA-3-70B when compressing early blocks).
SVD for all layers in the largest tested model (LLaMA-3-70B, 80 layers) runs quickly: ~40 seconds.
Results
KV cache size (LLaMA-2-13B)
KV cache size (LLaMA-3-Instruct-8B)
KV cache size (LLaMA-3-Instruct-70B)
Average performance drop (across 4 tasks)
Who Should Care
What To Try In 7 Days
Run per-layer SVD on your model weights (one-time) to measure singular value decay and per-layer low-rank structure.
Apply LoRC with conservative d_min/d_max and the paper's cumulative-condition threshold to preserve shallow layers.
Benchmark memory savings and task accuracy on 1–2 core workloads (e.g., summarize and QA) to tune thresholds.
Agent Features
Memory
- reduces KV weight/cache memory
Tool Use
- SVD
- weight-level compression
Optimization Features
Token Efficiency
- no token eviction needed
Infra Optimization
- lower GPU memory usage enables larger batches/longer context
Model Optimization
- low-rank SVD on KV weight matrices
- update query/output matrices to absorb left singular vectors
System Optimization
- one-time SVD preprocessing (fast)
Training Optimization
- no retraining required
Inference Optimization
- reduced KV cache size per layer
- supports MHA and GQA without model change
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Compressing early (shallow) layers can amplify errors and greatly reduce accuracy.
- Experiments are limited to LLaMA variants with MHA/GQA and four tasks; other models/tasks untested.
- Requires setting per-model thresholds (cumulative condition number) which the paper tuned per model.
When Not To Use
- When you must aggressively compress the first few layers; LoRC recommends keeping shallow layers mostly intact.
- If you need guarantees on worst-case outputs for safety-critical systems without further validation.
- When the model architecture differs substantially from tested MHA/GQA implementations without verification.
Failure Modes
- Uniform compression across layers causes catastrophic drops (example: 68% drop on LLaMA-3-70B shallow-block compression).
- Improper thresholding may skip compression where it’s safe or compress sensitive layers too much.
- Edge cases where activation Lipschitz constants are large can increase error amplification beyond theoretical bounds.
Core Entities
Models
- LLaMA-2-13B
- LLaMA-3-Instruct-8B
- LLaMA-3-Instruct-70B
Metrics
- KV cache size (GB)
- Compression ratio
- Accuracy
Datasets
- BoolQ
- XSum
- OpenBookQA
- GSM8K
Benchmarks
- ROUGE
- Accuracy

