Move KV-cache fetching and decompression off GPUs to SmartNICs to eliminate interference

September 21, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

0

Authors

Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

Links

Abstract / PDF

Why It Matters For Business

If you serve LLMs over limited network links or on low-bandwidth GPU instances, offloading KV-cache fetch and decompression to SmartNICs can cut per-token latency and improve throughput without changing compression code.

Summary TLDR

ShadowServe offloads KV-cache network fetch and decompression from host GPU/CPU to a SmartNIC data plane. It uses an asynchronous control plane, a chunked pipeline across SmartNIC resources, and minimal-copy memory management to avoid GPU/CPU interference. In low-bandwidth setups (≤20 Gbps) ShadowServe lowers first-token latency and per-token decoding latency and raises throughput versus GPU-decompression baselines; SmartNIC memory limits its peak fetching rate in high-bandwidth scenarios.

Problem Statement

Distributed prefix caching helps avoid expensive LLM prefill work by fetching precomputed KV caches. But fetching compressed KV caches and decompressing them on the GPU or host CPU creates heavy interference with model compute or overloads CPUs. The paper asks: can we fetch and decompress without causing that interference while still keeping SmartNIC resource use efficient?

Main Contribution

Identify and measure strong bidirectional interference when GPU decompresses KV cache concurrently with model decode.

Design ShadowServe: separate host control plane and a SmartNIC-only data plane that fetches, decompresses, dequantizes, and DMA-copies KV cache into GPU memory.

Introduce a chunked pipeline and occupancy-based minimal-copy memory management to parallelize work on constrained SmartNIC cores and accelerators.

Implement a prototype on NVIDIA BlueField-3 and show lower per-token latency and improved throughput in low-bandwidth settings, plus analysis of SmartNIC bottlenecks.

Key Findings

Offloading decompression to the SmartNIC cuts per-output-token latency under load.

Numbers1.06–2.19× lower loaded TPOT across configs; up to 2.2× reported

ShadowServe reduces time-to-first-token (TTFT) in low-bandwidth setups (≤20 Gbps).

Numbers1.20–1.38× lower unloaded TTFT below 20 Gbps; example 502.2ms vs 600.5ms

Overall throughput improves for many low-bandwidth or long-output settings.

NumbersUp to 1.35× higher throughput reported; example 1.78 vs 1.51 req/s in one test

SmartNIC memory subsystem creates a new bottleneck when stages run together.

NumbersNetwork standalone 37.3 Gbps → actual 20.6 Gbps when pipelined (−45%); Deflate/DMA slowdown 24–27%

Results

unloaded TTFT

ValueShadowServe 502.2 ms vs CacheGen-Async 600.5 ms

BaselineCacheGen-Async (GPU decompression)

loaded TPOT

ValueShadowServe 41.8 ms vs CacheGen-Async 52.0 ms

BaselineCacheGen-Async

maximum throughput

ValueShadowServe 1.78 req/s vs CacheGen-Async 1.51 req/s

BaselineCacheGen-Async

SmartNIC pipeline network throughput

ValueStandalone network 37.3 Gbps → actual 20.6 Gbps with full pipeline

Baselinestandalone microbench

Who Should Care

What To Try In 7 Days

Measure TPOT and TTFT with/without local GV-cache decompression to see interference.

If you have SmartNICs with decompression + P2P DMA, prototype offloading a fetch+decompress path for compressed KV chunks.

Enable asynchronous fetching in your serving scheduler and add a background fetch queue to hide I/O latency.

Optimization Features

Infra Optimization

  • SmartNIC data plane offload
  • use of hardware decompression accelerators

System Optimization

  • minimal-copy memory management
  • resource partitioning on SmartNIC
  • peer-to-peer DMA

Inference Optimization

  • decompression offload
  • asynchronous fetching
  • chunked pipelining

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Performance limited by SmartNIC memory/cache bandwidth; TTFT stops improving above ~20 Gbps on BlueField-3 (§6.3).
  • Requires SmartNICs with hardware decompression and peer-to-peer DMA; not useful on hosts without such hardware.
  • Prototype does not implement chunked prefill or partial hits; those are noted as orthogonal and future work (§4.1).
  • ShadowServe does not change compression algorithms or ratios; it only offloads existing compress/decompress steps.

When Not To Use

  • Environments with very high network bandwidth (>40 Gbps) where SmartNICs become the bottleneck.
  • Deployments without SmartNICs that support decompression accelerators and P2P DMA.
  • Workloads dominated by extremely long outputs (>128 tokens) where fetch stalls dominate and both systems converge.

Failure Modes

  • Inter-stage memory contention on SmartNIC reduces network throughput and increases TTFT.
  • If SmartNIC buffer sizes or chunk size are misconfigured, DMA/scatter overhead increases and benefits shrink.
  • GPU kernel scheduling behavior (streams/priorities) can non-deterministically affect interference and throughput.
  • On-demand memory registration (if not using minimal-copy) causes large TTFT spikes.

Core Entities

Models

  • Llama-8B
  • Mistral-7B

Metrics

  • TTFT
  • TPOT
  • throughput
  • fetch latency

Datasets

  • TriviaQA
  • NarrativeQA
  • LongBench

Benchmarks

  • LongBench