Overview
Method is simple and plug-and-play, demonstrated on three LLaMA variants and four tasks; theoretical bounds and fast SVD support deployment, but evaluations are limited to those models and tasks.
Citations1
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.
Who Should Care
Summary TLDR
LoRC compresses the KV cache by applying truncated SVD to attention key/value weight matrices and choosing per-layer compressed dimensions with a progressive rule based on cumulative condition numbers. The method is plug-and-play (no retraining), works with MHA and GQA attention, and achieves ~55–60% KV memory reduction on LLaMA variants while keeping average task performance loss below 1% on evaluated benchmarks.
Problem Statement
KV cache memory grows with sequence length and batch size and becomes a bottleneck for serving LLMs. Existing fixes either change attention during training or drop tokens at test time; both require model changes or task-specific tuning. We need a simple, post-hoc compression method that reduces KV cache memory without retraining and that avoids amplifying errors across layers.
Main Contribution
A post-hoc, weight-level KV cache compression method using low-rank (truncated SVD) approximation of key and value weight matrices.
A progressive layerwise compression strategy that sets per-layer compressed dimensions using cumulative condition numbers to limit error amplification from shallow layers.
Key Findings
LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.
Example per-model reductions: LLaMA-2-13B KV cache from 50G to 27.5G (55%) with 0.47% avg drop.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| KV cache size (LLaMA-2-13B) | 50G → 27.5G | 50G | −45% | batch size 64, seq len 2048 (Table 2) | Table 2 reports reduction to 27.5G at 55% compression ratio | Table 2 |
| KV cache size (LLaMA-3-Instruct-8B) | 8G → 4.8G | 8G | −40% | batch size 64, seq len 2048 (Table 2) | Table 2 reports reduction to 4.8G at 60% compression ratio | Table 2 |
What To Try In 7 Days
Run per-layer SVD on your model weights (one-time) to measure singular value decay and per-layer low-rank structure.
Apply LoRC with conservative d_min/d_max and the paper's cumulative-condition threshold to preserve shallow layers.
Benchmark memory savings and task accuracy on 1–2 core workloads (e.g., summarize and QA) to tune thresholds.
Agent Features
Memory
Tool Use
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Compressing early (shallow) layers can amplify errors and greatly reduce accuracy.
Experiments are limited to LLaMA variants with MHA/GQA and four tasks; other models/tasks untested.
When Not To Use
When you must aggressively compress the first few layers; LoRC recommends keeping shallow layers mostly intact.
If you need guarantees on worst-case outputs for safety-critical systems without further validation.
Failure Modes
Uniform compression across layers causes catastrophic drops (example: 68% drop on LLaMA-3-70B shallow-block compression).
Improper thresholding may skip compression where it’s safe or compress sensitive layers too much.

