Compute directly on 2-bit KV cache to cut network, memory and compute time for disaggregated LLMs

February 5, 20258 min

Overview

Decision SnapshotReady For Pilot

Clear system design and measurements show large latency and memory wins on long-context workloads; results rely on Triton/INT8 implementation and specific GPU/network mixes.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you serve LLMs with separate prefill and decode GPUs (to cut costs), HACK can cut latency and network costs by executing on compressed KV directly—most useful for long-context services where KV transfer dominates latency.

Who Should Care

Summary TLDR

HACK is a system-level method that quantizes the Key-Value (KV) cache (using asymmetric 2-bit per partition) and performs matrix multiplications directly on the quantized data (homomorphic quantization). This avoids costly per-iteration KV dequantization and uses GPU INT8 paths to accelerate both prefill and decode. On long-context workloads and common cloud GPU mixes, HACK cuts end-to-end job completion time (JCT) by up to 70.9% vs a disaggregated baseline and up to 52.3% vs prior KV-quantization methods, while keeping accuracy loss around 0.5–2.7% depending on partitioning.

Problem Statement

Disaggregated LLM serving splits prompt prefill (compute-heavy) from token decode (memory-heavy). Sending large KV cache across the network, repeated per-iteration dequantization, and memory access delays create major slowdowns for long contexts. Existing quantization reduces transfer size but reintroduces heavy dequantization costs and does not cut compute time.

Main Contribution

Homomorphic quantization method that multiplies on quantized matrices and then approximates the true output, avoiding per-iteration KV dequantization.

Two runtime optimizations: store partition sums to avoid repeated summation (summation elimination) and keep the trailing partial V-block in FP16 to avoid repeated requantization.

Key Findings

KV transmission can be a major part of latency in disaggregated setups.

NumbersKV transmission up to 42.2% of JCT (measured)

Practical UseIf you run prefill and decode on separate (cheaper) GPUs, compressing KV transfers is necessary to avoid large network-driven delays.

Evidence Ref§2, Observation 1

Existing KV quantization adds costly dequantization work during decode.

NumbersDequantization overhead up to 37.9% of JCT (CacheGen/KVQuant)

Practical UseSimple quantize-then-dequantize pipelines can trade network savings for heavy CPU/GPU work at each decode step; avoid designs that require full per-iteration dequantize.

Evidence Ref§2.2, Fig.2-4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Max JCT reduction vs disaggregated baseline70.9%disaggregated LLM inference baselineLlama-3.1 70B, Cocktail, V100 prefill (reported max)§7.2, Fig.12 reports up to 70.9% reduction vs baseline§7.2, Fig.12
Max JCT reduction vs KVQuant/CacheGen52.3%KVQuant (or CacheGen)Llama-3.1 70B, Cocktail, A100/A10G mixes (reported max)Abstract; §7.2: up to 52.3% vs state-of-the-art KV quantizationAbstract; §7.2

What To Try In 7 Days

Benchmark current disaggregated pipeline and measure KV transfer percent of JCT.

Run vLLM+FlashAttention2 with HACK on a representative long-context workload and compare JCT vs current quantization.

Tune partition size (Π=64 default) to trade a small accuracy drop for throughput gains.

Optimization Features

Token Efficiency
stores quantized KV to reduce transfer bandwidth
Infra Optimization
better utilization of cheaper prefill GPUs by reducing network load
Model Optimization
compute on quantized matrices (homomorphic quantization)asymmetric stochastic 2-bit quantization per partition
System Optimization
summation elimination: cache partition sumsrequantization elimination: keep last V-block in FP16
Inference Optimization
avoid per-iteration KV dequantizationuse GPU INT8 paths for quantized matmulkernel fusion (QKV gen + quant + attn)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

2-bit quantization forces small partition sizes for accuracy, which can increase JCT; Π=64 is the default trade-off.

Current implementation uses Triton and converts 2-bit -> INT8, adding runtime overhead; INT4 direct support would help.

When Not To Use

Very short-context workloads where sequence length ≤ ~30 tokens (benefits shrink).

When strict zero-loss accuracy is required (HACK causes ~0.5–2.7% accuracy loss vs full FP16).

Failure Modes

Accumulation of quantization error during long decodes if partitioning/reqantization is mishandled.

Reduced effectiveness on GPUs without INT8 accelerated paths (e.g., older tensor cores).

Core Entities

Models

Mistral-v0.3-7BPhi-3-14BYi-34BLlama-3.1-70BFalcon-180B

Metrics

Job Completion Time (JCT)Peak GPU memory usageKV transmission time ratioDequantization / approximation time ratioAccuracy

Datasets

Cocktail (IR, long-context)arXiv (long documents)HumanEval (code)IMDb (classification)

Benchmarks

ROUGE-1Edit Similarity (normalized Levenshtein)