Move KV-cache fetching and decompression off GPUs to SmartNICs to eliminate interference

Overview

Decision SnapshotNeeds Validation

The system is implemented and evaluated on BlueField-3 with clear gains in low-bandwidth setups; gains depend on availability of SmartNIC features and on-device memory bandwidth.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 55%

Authors

Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

Links

Abstract / PDF

Why It Matters For Business

If you serve LLMs over limited network links or on low-bandwidth GPU instances, offloading KV-cache fetch and decompression to SmartNICs can cut per-token latency and improve throughput without changing compression code.

Who Should Care

Product Manager Engineering Lead ML Engineer CTO Founder

Summary TLDR

ShadowServe offloads KV-cache network fetch and decompression from host GPU/CPU to a SmartNIC data plane. It uses an asynchronous control plane, a chunked pipeline across SmartNIC resources, and minimal-copy memory management to avoid GPU/CPU interference. In low-bandwidth setups (≤20 Gbps) ShadowServe lowers first-token latency and per-token decoding latency and raises throughput versus GPU-decompression baselines; SmartNIC memory limits its peak fetching rate in high-bandwidth scenarios.

Problem Statement

Distributed prefix caching helps avoid expensive LLM prefill work by fetching precomputed KV caches. But fetching compressed KV caches and decompressing them on the GPU or host CPU creates heavy interference with model compute or overloads CPUs. The paper asks: can we fetch and decompress without causing that interference while still keeping SmartNIC resource use efficient?

Main Contribution

Identify and measure strong bidirectional interference when GPU decompresses KV cache concurrently with model decode.

Design ShadowServe: separate host control plane and a SmartNIC-only data plane that fetches, decompresses, dequantizes, and DMA-copies KV cache into GPU memory.

Key Findings

Offloading decompression to the SmartNIC cuts per-output-token latency under load.

Numbers1.06–2.19× lower loaded TPOT across configs; up to 2.2× reported

Practical UseIf your workload is sensitive to per-token latency (TPOT), offload decompression to a SmartNIC to reduce decode slowdown from GPU multitasking.

Evidence RefFigures 10,11; §6.2

ShadowServe reduces time-to-first-token (TTFT) in low-bandwidth setups (≤20 Gbps).

Numbers1.20–1.38× lower unloaded TTFT below 20 Gbps; example 502.2ms vs 600.5ms

Practical UseOn cloud instances or low-bandwidth GPUs, use SmartNIC offload to get the first token faster when KV cache fetching dominates latency.

Evidence RefFigure 9, §6.2.1 and §6.2.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
unloaded TTFT	ShadowServe 502.2 ms vs CacheGen-Async 600.5 ms	CacheGen-Async (GPU decompression)	1.20–1.38× faster below 20 Gbps	Llama-8B + NarrativeQA, low-bandwidth (≤20 Gbps)	Figure 9, §6.2.1	§6.2.1
loaded TPOT	ShadowServe 41.8 ms vs CacheGen-Async 52.0 ms	CacheGen-Async	1.06–2.19× lower across settings; example 1.26× here	Llama-8B + NarrativeQA, 20 Gbps, output len 32	Figure 9, Figure 11a, §6.2.1	§6.2.1

What To Try In 7 Days

Measure TPOT and TTFT with/without local GV-cache decompression to see interference.

If you have SmartNICs with decompression + P2P DMA, prototype offloading a fetch+decompress path for compressed KV chunks.

Enable asynchronous fetching in your serving scheduler and add a background fetch queue to hide I/O latency.

Optimization Features

Infra Optimization

SmartNIC data plane offloaduse of hardware decompression accelerators

System Optimization

minimal-copy memory managementresource partitioning on SmartNICpeer-to-peer DMA

Inference Optimization

decompression offloadasynchronous fetchingchunked pipelining

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Performance limited by SmartNIC memory/cache bandwidth; TTFT stops improving above ~20 Gbps on BlueField-3 (§6.3).

Requires SmartNICs with hardware decompression and peer-to-peer DMA; not useful on hosts without such hardware.

When Not To Use

Environments with very high network bandwidth (>40 Gbps) where SmartNICs become the bottleneck.

Deployments without SmartNICs that support decompression accelerators and P2P DMA.

Failure Modes

Inter-stage memory contention on SmartNIC reduces network throughput and increases TTFT.

If SmartNIC buffer sizes or chunk size are misconfigured, DMA/scatter overhead increases and benefits shrink.

Core Entities

Models

Llama-8BMistral-7B

Metrics

TTFTTPOTthroughputfetch latency

Datasets

TriviaQANarrativeQALongBench

Benchmarks

LongBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Offloading decompression to the SmartNIC cuts per-output-token latency under load.

ShadowServe reduces time-to-first-token (TTFT) in low-bandwidth setups (≤20 Gbps).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding