Overview
Clear system design and measurements show large latency and memory wins on long-context workloads; results rely on Triton/INT8 implementation and specific GPU/network mixes.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
If you serve LLMs with separate prefill and decode GPUs (to cut costs), HACK can cut latency and network costs by executing on compressed KV directly—most useful for long-context services where KV transfer dominates latency.
Who Should Care
Summary TLDR
HACK is a system-level method that quantizes the Key-Value (KV) cache (using asymmetric 2-bit per partition) and performs matrix multiplications directly on the quantized data (homomorphic quantization). This avoids costly per-iteration KV dequantization and uses GPU INT8 paths to accelerate both prefill and decode. On long-context workloads and common cloud GPU mixes, HACK cuts end-to-end job completion time (JCT) by up to 70.9% vs a disaggregated baseline and up to 52.3% vs prior KV-quantization methods, while keeping accuracy loss around 0.5–2.7% depending on partitioning.
Problem Statement
Disaggregated LLM serving splits prompt prefill (compute-heavy) from token decode (memory-heavy). Sending large KV cache across the network, repeated per-iteration dequantization, and memory access delays create major slowdowns for long contexts. Existing quantization reduces transfer size but reintroduces heavy dequantization costs and does not cut compute time.
Main Contribution
Homomorphic quantization method that multiplies on quantized matrices and then approximates the true output, avoiding per-iteration KV dequantization.
Two runtime optimizations: store partition sums to avoid repeated summation (summation elimination) and keep the trailing partial V-block in FP16 to avoid repeated requantization.
Key Findings
KV transmission can be a major part of latency in disaggregated setups.
Existing KV quantization adds costly dequantization work during decode.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Max JCT reduction vs disaggregated baseline | 70.9% | disaggregated LLM inference baseline | — | Llama-3.1 70B, Cocktail, V100 prefill (reported max) | §7.2, Fig.12 reports up to 70.9% reduction vs baseline | §7.2, Fig.12 |
| Max JCT reduction vs KVQuant/CacheGen | 52.3% | KVQuant (or CacheGen) | — | Llama-3.1 70B, Cocktail, A100/A10G mixes (reported max) | Abstract; §7.2: up to 52.3% vs state-of-the-art KV quantization | Abstract; §7.2 |
What To Try In 7 Days
Benchmark current disaggregated pipeline and measure KV transfer percent of JCT.
Run vLLM+FlashAttention2 with HACK on a representative long-context workload and compare JCT vs current quantization.
Tune partition size (Π=64 default) to trade a small accuracy drop for throughput gains.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
2-bit quantization forces small partition sizes for accuracy, which can increase JCT; Π=64 is the default trade-off.
Current implementation uses Triton and converts 2-bit -> INT8, adding runtime overhead; INT4 direct support would help.
When Not To Use
Very short-context workloads where sequence length ≤ ~30 tokens (benefits shrink).
When strict zero-loss accuracy is required (HACK causes ~0.5–2.7% accuracy loss vs full FP16).
Failure Modes
Accumulation of quantization error during long decodes if partitioning/reqantization is mishandled.
Reduced effectiveness on GPUs without INT8 accelerated paths (e.g., older tensor cores).

