Compute directly on 2-bit KV cache to cut network, memory and compute time for disaggregated LLMs

Overview

Decision SnapshotReady For Pilot

Clear system design and measurements show large latency and memory wins on long-context workloads; results rely on Triton/INT8 implementation and specific GPU/network mixes.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you serve LLMs with separate prefill and decode GPUs (to cut costs), HACK can cut latency and network costs by executing on compressed KV directly—most useful for long-context services where KV transfer dominates latency.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Founder

Summary TLDR

HACK is a system-level method that quantizes the Key-Value (KV) cache (using asymmetric 2-bit per partition) and performs matrix multiplications directly on the quantized data (homomorphic quantization). This avoids costly per-iteration KV dequantization and uses GPU INT8 paths to accelerate both prefill and decode. On long-context workloads and common cloud GPU mixes, HACK cuts end-to-end job completion time (JCT) by up to 70.9% vs a disaggregated baseline and up to 52.3% vs prior KV-quantization methods, while keeping accuracy loss around 0.5–2.7% depending on partitioning.

Problem Statement

Disaggregated LLM serving splits prompt prefill (compute-heavy) from token decode (memory-heavy). Sending large KV cache across the network, repeated per-iteration dequantization, and memory access delays create major slowdowns for long contexts. Existing quantization reduces transfer size but reintroduces heavy dequantization costs and does not cut compute time.

Main Contribution

Homomorphic quantization method that multiplies on quantized matrices and then approximates the true output, avoiding per-iteration KV dequantization.

Two runtime optimizations: store partition sums to avoid repeated summation (summation elimination) and keep the trailing partial V-block in FP16 to avoid repeated requantization.

Key Findings

KV transmission can be a major part of latency in disaggregated setups.

NumbersKV transmission up to 42.2% of JCT (measured)

Practical UseIf you run prefill and decode on separate (cheaper) GPUs, compressing KV transfers is necessary to avoid large network-driven delays.

Evidence Ref§2, Observation 1

Existing KV quantization adds costly dequantization work during decode.

NumbersDequantization overhead up to 37.9% of JCT (CacheGen/KVQuant)

Practical UseSimple quantize-then-dequantize pipelines can trade network savings for heavy CPU/GPU work at each decode step; avoid designs that require full per-iteration dequantize.

Evidence Ref§2.2, Fig.2-4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Max JCT reduction vs disaggregated baseline	70.9%	disaggregated LLM inference baseline	—	Llama-3.1 70B, Cocktail, V100 prefill (reported max)	§7.2, Fig.12 reports up to 70.9% reduction vs baseline	§7.2, Fig.12
Max JCT reduction vs KVQuant/CacheGen	52.3%	KVQuant (or CacheGen)	—	Llama-3.1 70B, Cocktail, A100/A10G mixes (reported max)	Abstract; §7.2: up to 52.3% vs state-of-the-art KV quantization	Abstract; §7.2

What To Try In 7 Days

Benchmark current disaggregated pipeline and measure KV transfer percent of JCT.

Run vLLM+FlashAttention2 with HACK on a representative long-context workload and compare JCT vs current quantization.

Tune partition size (Π=64 default) to trade a small accuracy drop for throughput gains.

Optimization Features

Token Efficiency

stores quantized KV to reduce transfer bandwidth

Infra Optimization

better utilization of cheaper prefill GPUs by reducing network load

Model Optimization

compute on quantized matrices (homomorphic quantization)asymmetric stochastic 2-bit quantization per partition

System Optimization

summation elimination: cache partition sumsrequantization elimination: keep last V-block in FP16

Inference Optimization

avoid per-iteration KV dequantizationuse GPU INT8 paths for quantized matmulkernel fusion (QKV gen + quant + attn)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/HKVQ

Data URLs

https://arxiv.org (arXiv dataset reference)https://github.com/Cocktail-benchmark (Cocktail reference)IMDb dataset (public)HumanEval (public)

Risks & Boundaries

Limitations

2-bit quantization forces small partition sizes for accuracy, which can increase JCT; Π=64 is the default trade-off.

Current implementation uses Triton and converts 2-bit -> INT8, adding runtime overhead; INT4 direct support would help.

When Not To Use

Very short-context workloads where sequence length ≤ ~30 tokens (benefits shrink).

When strict zero-loss accuracy is required (HACK causes ~0.5–2.7% accuracy loss vs full FP16).

Failure Modes

Accumulation of quantization error during long decodes if partitioning/reqantization is mishandled.

Reduced effectiveness on GPUs without INT8 accelerated paths (e.g., older tensor cores).

Core Entities

Models

Mistral-v0.3-7BPhi-3-14BYi-34BLlama-3.1-70BFalcon-180B

Metrics

Job Completion Time (JCT)Peak GPU memory usageKV transmission time ratioDequantization / approximation time ratioAccuracy

Datasets

Cocktail (IR, long-context)arXiv (long documents)HumanEval (code)IMDb (classification)

Benchmarks

ROUGE-1Edit Similarity (normalized Levenshtein)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KV transmission can be a major part of latency in disaggregated setups.

Existing KV quantization adds costly dequantization work during decode.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding