Compute directly on 2-bit KV cache to cut network, memory and compute time for disaggregated LLMs

February 5, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu

Links

Abstract / PDF

Why It Matters For Business

If you serve LLMs with separate prefill and decode GPUs (to cut costs), HACK can cut latency and network costs by executing on compressed KV directly—most useful for long-context services where KV transfer dominates latency.

Summary TLDR

HACK is a system-level method that quantizes the Key-Value (KV) cache (using asymmetric 2-bit per partition) and performs matrix multiplications directly on the quantized data (homomorphic quantization). This avoids costly per-iteration KV dequantization and uses GPU INT8 paths to accelerate both prefill and decode. On long-context workloads and common cloud GPU mixes, HACK cuts end-to-end job completion time (JCT) by up to 70.9% vs a disaggregated baseline and up to 52.3% vs prior KV-quantization methods, while keeping accuracy loss around 0.5–2.7% depending on partitioning.

Problem Statement

Disaggregated LLM serving splits prompt prefill (compute-heavy) from token decode (memory-heavy). Sending large KV cache across the network, repeated per-iteration dequantization, and memory access delays create major slowdowns for long contexts. Existing quantization reduces transfer size but reintroduces heavy dequantization costs and does not cut compute time.

Main Contribution

Homomorphic quantization method that multiplies on quantized matrices and then approximates the true output, avoiding per-iteration KV dequantization.

Two runtime optimizations: store partition sums to avoid repeated summation (summation elimination) and keep the trailing partial V-block in FP16 to avoid repeated requantization.

Integration into FlashAttention-2 and vLLM with Triton kernels; extensive trace-driven experiments across models, datasets, and GPU instance types; open-sourced code.

Key Findings

KV transmission can be a major part of latency in disaggregated setups.

NumbersKV transmission up to 42.2% of JCT (measured)

Existing KV quantization adds costly dequantization work during decode.

NumbersDequantization overhead up to 37.9% of JCT (CacheGen/KVQuant)

HACK replaces dequantization with a lightweight approximation and runs multiplications on quantized data.

NumbersApproximation overhead 1.53%–3.18% vs dequantization 17.2%–30.4% (other methods)

End-to-end speedups are large, especially for long contexts and low-bandwidth prefill GPUs.

NumbersJCT reduced up to 70.9% vs baseline and up to 52.3% vs KVQuant (reported maxima)

Memory footprint improves vs baseline but stores a small extra structure for speed.

NumbersPeak GPU memory on decode reduced 13.9%–33.6% vs baseline; HACK adds ~2.2%–2.7% memory for partition sums and 0.24%–0.51

Results

Max JCT reduction vs disaggregated baseline

Value70.9%

Baselinedisaggregated LLM inference baseline

Max JCT reduction vs KVQuant/CacheGen

Value52.3%

BaselineKVQuant (or CacheGen)

Dequantization vs HACK approximation overhead

Valuedequantization 17.2%–30.4% vs approximation 1.53%–3.18% of JCT

BaselineCacheGen/KVQuant dequantization

KV compression (approx.)

ValueKV stored at ~15% of original size (~85% reduction)

BaselineFP16 baseline

Accuracy

Value≈0.76%–1.56% (range across models/datasets)

Baselinefull-precision baseline

Who Should Care

What To Try In 7 Days

Benchmark current disaggregated pipeline and measure KV transfer percent of JCT.

Run vLLM+FlashAttention2 with HACK on a representative long-context workload and compare JCT vs current quantization.

Tune partition size (Π=64 default) to trade a small accuracy drop for throughput gains.

Optimization Features

Token Efficiency

  • stores quantized KV to reduce transfer bandwidth

Infra Optimization

  • better utilization of cheaper prefill GPUs by reducing network load

Model Optimization

  • compute on quantized matrices (homomorphic quantization)
  • asymmetric stochastic 2-bit quantization per partition

System Optimization

  • summation elimination: cache partition sums
  • requantization elimination: keep last V-block in FP16

Inference Optimization

  • avoid per-iteration KV dequantization
  • use GPU INT8 paths for quantized matmul
  • kernel fusion (QKV gen + quant + attn)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • 2-bit quantization forces small partition sizes for accuracy, which can increase JCT; Π=64 is the default trade-off.
  • Current implementation uses Triton and converts 2-bit -> INT8, adding runtime overhead; INT4 direct support would help.
  • Some GPUs (e.g., V100) lack INT8 matmul acceleration, reducing HACK's prefill speedups.
  • Small extra memory needed: ~2.2%–2.7% for partition sums and 0.24%–0.51% for FP16 last-V buffer.

When Not To Use

  • Very short-context workloads where sequence length ≤ ~30 tokens (benefits shrink).
  • When strict zero-loss accuracy is required (HACK causes ~0.5–2.7% accuracy loss vs full FP16).
  • On hardware that cannot accelerate INT8 matmul efficiently (little to no speedup).

Failure Modes

  • Accumulation of quantization error during long decodes if partitioning/reqantization is mishandled.
  • Reduced effectiveness on GPUs without INT8 accelerated paths (e.g., older tensor cores).
  • Incorrect partition size tuning can trade unacceptable accuracy for throughput.

Core Entities

Models

  • Mistral-v0.3-7B
  • Phi-3-14B
  • Yi-34B
  • Llama-3.1-70B
  • Falcon-180B

Metrics

  • Job Completion Time (JCT)
  • Peak GPU memory usage
  • KV transmission time ratio
  • Dequantization / approximation time ratio
  • Accuracy

Datasets

  • Cocktail (IR, long-context)
  • arXiv (long documents)
  • HumanEval (code)
  • IMDb (classification)

Benchmarks

  • ROUGE-1
  • Edit Similarity (normalized Levenshtein)