Overview
The system is implemented and evaluated on BlueField-3 with clear gains in low-bandwidth setups; gains depend on availability of SmartNIC features and on-device memory bandwidth.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 55%
Why It Matters For Business
If you serve LLMs over limited network links or on low-bandwidth GPU instances, offloading KV-cache fetch and decompression to SmartNICs can cut per-token latency and improve throughput without changing compression code.
Who Should Care
Summary TLDR
ShadowServe offloads KV-cache network fetch and decompression from host GPU/CPU to a SmartNIC data plane. It uses an asynchronous control plane, a chunked pipeline across SmartNIC resources, and minimal-copy memory management to avoid GPU/CPU interference. In low-bandwidth setups (≤20 Gbps) ShadowServe lowers first-token latency and per-token decoding latency and raises throughput versus GPU-decompression baselines; SmartNIC memory limits its peak fetching rate in high-bandwidth scenarios.
Problem Statement
Distributed prefix caching helps avoid expensive LLM prefill work by fetching precomputed KV caches. But fetching compressed KV caches and decompressing them on the GPU or host CPU creates heavy interference with model compute or overloads CPUs. The paper asks: can we fetch and decompress without causing that interference while still keeping SmartNIC resource use efficient?
Main Contribution
Identify and measure strong bidirectional interference when GPU decompresses KV cache concurrently with model decode.
Design ShadowServe: separate host control plane and a SmartNIC-only data plane that fetches, decompresses, dequantizes, and DMA-copies KV cache into GPU memory.
Key Findings
Offloading decompression to the SmartNIC cuts per-output-token latency under load.
ShadowServe reduces time-to-first-token (TTFT) in low-bandwidth setups (≤20 Gbps).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| unloaded TTFT | ShadowServe 502.2 ms vs CacheGen-Async 600.5 ms | CacheGen-Async (GPU decompression) | 1.20–1.38× faster below 20 Gbps | Llama-8B + NarrativeQA, low-bandwidth (≤20 Gbps) | Figure 9, §6.2.1 | §6.2.1 |
| loaded TPOT | ShadowServe 41.8 ms vs CacheGen-Async 52.0 ms | CacheGen-Async | 1.06–2.19× lower across settings; example 1.26× here | Llama-8B + NarrativeQA, 20 Gbps, output len 32 | Figure 9, Figure 11a, §6.2.1 | §6.2.1 |
What To Try In 7 Days
Measure TPOT and TTFT with/without local GV-cache decompression to see interference.
If you have SmartNICs with decompression + P2P DMA, prototype offloading a fetch+decompress path for compressed KV chunks.
Enable asynchronous fetching in your serving scheduler and add a background fetch queue to hide I/O latency.
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Performance limited by SmartNIC memory/cache bandwidth; TTFT stops improving above ~20 Gbps on BlueField-3 (§6.3).
Requires SmartNICs with hardware decompression and peer-to-peer DMA; not useful on hosts without such hardware.
When Not To Use
Environments with very high network bandwidth (>40 Gbps) where SmartNICs become the bottleneck.
Deployments without SmartNICs that support decompression accelerators and P2P DMA.
Failure Modes
Inter-stage memory contention on SmartNIC reduces network throughput and increases TTFT.
If SmartNIC buffer sizes or chunk size are misconfigured, DMA/scatter overhead increases and benefits shrink.

