Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
If you serve LLMs with separate prefill and decode GPUs (to cut costs), HACK can cut latency and network costs by executing on compressed KV directly—most useful for long-context services where KV transfer dominates latency.
Summary TLDR
HACK is a system-level method that quantizes the Key-Value (KV) cache (using asymmetric 2-bit per partition) and performs matrix multiplications directly on the quantized data (homomorphic quantization). This avoids costly per-iteration KV dequantization and uses GPU INT8 paths to accelerate both prefill and decode. On long-context workloads and common cloud GPU mixes, HACK cuts end-to-end job completion time (JCT) by up to 70.9% vs a disaggregated baseline and up to 52.3% vs prior KV-quantization methods, while keeping accuracy loss around 0.5–2.7% depending on partitioning.
Problem Statement
Disaggregated LLM serving splits prompt prefill (compute-heavy) from token decode (memory-heavy). Sending large KV cache across the network, repeated per-iteration dequantization, and memory access delays create major slowdowns for long contexts. Existing quantization reduces transfer size but reintroduces heavy dequantization costs and does not cut compute time.
Main Contribution
Homomorphic quantization method that multiplies on quantized matrices and then approximates the true output, avoiding per-iteration KV dequantization.
Two runtime optimizations: store partition sums to avoid repeated summation (summation elimination) and keep the trailing partial V-block in FP16 to avoid repeated requantization.
Integration into FlashAttention-2 and vLLM with Triton kernels; extensive trace-driven experiments across models, datasets, and GPU instance types; open-sourced code.
Key Findings
KV transmission can be a major part of latency in disaggregated setups.
Existing KV quantization adds costly dequantization work during decode.
HACK replaces dequantization with a lightweight approximation and runs multiplications on quantized data.
End-to-end speedups are large, especially for long contexts and low-bandwidth prefill GPUs.
Memory footprint improves vs baseline but stores a small extra structure for speed.
Results
Max JCT reduction vs disaggregated baseline
Max JCT reduction vs KVQuant/CacheGen
Dequantization vs HACK approximation overhead
KV compression (approx.)
Accuracy
Who Should Care
What To Try In 7 Days
Benchmark current disaggregated pipeline and measure KV transfer percent of JCT.
Run vLLM+FlashAttention2 with HACK on a representative long-context workload and compare JCT vs current quantization.
Tune partition size (Π=64 default) to trade a small accuracy drop for throughput gains.
Optimization Features
Token Efficiency
- stores quantized KV to reduce transfer bandwidth
Infra Optimization
- better utilization of cheaper prefill GPUs by reducing network load
Model Optimization
- compute on quantized matrices (homomorphic quantization)
- asymmetric stochastic 2-bit quantization per partition
System Optimization
- summation elimination: cache partition sums
- requantization elimination: keep last V-block in FP16
Inference Optimization
- avoid per-iteration KV dequantization
- use GPU INT8 paths for quantized matmul
- kernel fusion (QKV gen + quant + attn)
Reproducibility
Data Urls
- https://arxiv.org (arXiv dataset reference)
- https://github.com/Cocktail-benchmark (Cocktail reference)
- IMDb dataset (public)
- HumanEval (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- 2-bit quantization forces small partition sizes for accuracy, which can increase JCT; Π=64 is the default trade-off.
- Current implementation uses Triton and converts 2-bit -> INT8, adding runtime overhead; INT4 direct support would help.
- Some GPUs (e.g., V100) lack INT8 matmul acceleration, reducing HACK's prefill speedups.
- Small extra memory needed: ~2.2%–2.7% for partition sums and 0.24%–0.51% for FP16 last-V buffer.
When Not To Use
- Very short-context workloads where sequence length ≤ ~30 tokens (benefits shrink).
- When strict zero-loss accuracy is required (HACK causes ~0.5–2.7% accuracy loss vs full FP16).
- On hardware that cannot accelerate INT8 matmul efficiently (little to no speedup).
Failure Modes
- Accumulation of quantization error during long decodes if partitioning/reqantization is mishandled.
- Reduced effectiveness on GPUs without INT8 accelerated paths (e.g., older tensor cores).
- Incorrect partition size tuning can trade unacceptable accuracy for throughput.
Core Entities
Models
- Mistral-v0.3-7B
- Phi-3-14B
- Yi-34B
- Llama-3.1-70B
- Falcon-180B
Metrics
- Job Completion Time (JCT)
- Peak GPU memory usage
- KV transmission time ratio
- Dequantization / approximation time ratio
- Accuracy
Datasets
- Cocktail (IR, long-context)
- arXiv (long documents)
- HumanEval (code)
- IMDb (classification)
Benchmarks
- ROUGE-1
- Edit Similarity (normalized Levenshtein)

