Move KV-cache fetching and decompression off GPUs to SmartNICs to eliminate interference

September 21, 20257 min

Overview

Decision SnapshotNeeds Validation

The system is implemented and evaluated on BlueField-3 with clear gains in low-bandwidth setups; gains depend on availability of SmartNIC features and on-device memory bandwidth.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 55%

Authors

Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

Links

Abstract / PDF

Why It Matters For Business

If you serve LLMs over limited network links or on low-bandwidth GPU instances, offloading KV-cache fetch and decompression to SmartNICs can cut per-token latency and improve throughput without changing compression code.

Who Should Care

Summary TLDR

ShadowServe offloads KV-cache network fetch and decompression from host GPU/CPU to a SmartNIC data plane. It uses an asynchronous control plane, a chunked pipeline across SmartNIC resources, and minimal-copy memory management to avoid GPU/CPU interference. In low-bandwidth setups (≤20 Gbps) ShadowServe lowers first-token latency and per-token decoding latency and raises throughput versus GPU-decompression baselines; SmartNIC memory limits its peak fetching rate in high-bandwidth scenarios.

Problem Statement

Distributed prefix caching helps avoid expensive LLM prefill work by fetching precomputed KV caches. But fetching compressed KV caches and decompressing them on the GPU or host CPU creates heavy interference with model compute or overloads CPUs. The paper asks: can we fetch and decompress without causing that interference while still keeping SmartNIC resource use efficient?

Main Contribution

Identify and measure strong bidirectional interference when GPU decompresses KV cache concurrently with model decode.

Design ShadowServe: separate host control plane and a SmartNIC-only data plane that fetches, decompresses, dequantizes, and DMA-copies KV cache into GPU memory.

Key Findings

Offloading decompression to the SmartNIC cuts per-output-token latency under load.

Numbers1.062.19× lower loaded TPOT across configs; up to 2.2× reported

Practical UseIf your workload is sensitive to per-token latency (TPOT), offload decompression to a SmartNIC to reduce decode slowdown from GPU multitasking.

Evidence RefFigures 10,11; §6.2

ShadowServe reduces time-to-first-token (TTFT) in low-bandwidth setups (≤20 Gbps).

Numbers1.201.38× lower unloaded TTFT below 20 Gbps; example 502.2ms vs 600.5ms

Practical UseOn cloud instances or low-bandwidth GPUs, use SmartNIC offload to get the first token faster when KV cache fetching dominates latency.

Evidence RefFigure 9, §6.2.1 and §6.2.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
unloaded TTFTShadowServe 502.2 ms vs CacheGen-Async 600.5 msCacheGen-Async (GPU decompression)1.201.38× faster below 20 GbpsLlama-8B + NarrativeQA, low-bandwidth (≤20 Gbps)Figure 9, §6.2.1§6.2.1
loaded TPOTShadowServe 41.8 ms vs CacheGen-Async 52.0 msCacheGen-Async1.062.19× lower across settings; example 1.26× hereLlama-8B + NarrativeQA, 20 Gbps, output len 32Figure 9, Figure 11a, §6.2.1§6.2.1

What To Try In 7 Days

Measure TPOT and TTFT with/without local GV-cache decompression to see interference.

If you have SmartNICs with decompression + P2P DMA, prototype offloading a fetch+decompress path for compressed KV chunks.

Enable asynchronous fetching in your serving scheduler and add a background fetch queue to hide I/O latency.

Optimization Features

Infra Optimization
SmartNIC data plane offloaduse of hardware decompression accelerators
System Optimization
minimal-copy memory managementresource partitioning on SmartNICpeer-to-peer DMA
Inference Optimization
decompression offloadasynchronous fetchingchunked pipelining

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Performance limited by SmartNIC memory/cache bandwidth; TTFT stops improving above ~20 Gbps on BlueField-3 (§6.3).

Requires SmartNICs with hardware decompression and peer-to-peer DMA; not useful on hosts without such hardware.

When Not To Use

Environments with very high network bandwidth (>40 Gbps) where SmartNICs become the bottleneck.

Deployments without SmartNICs that support decompression accelerators and P2P DMA.

Failure Modes

Inter-stage memory contention on SmartNIC reduces network throughput and increases TTFT.

If SmartNIC buffer sizes or chunk size are misconfigured, DMA/scatter overhead increases and benefits shrink.

Core Entities

Models

Llama-8BMistral-7B

Metrics

TTFTTPOTthroughputfetch latency

Datasets

TriviaQANarrativeQALongBench

Benchmarks

LongBench