Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you serve LLMs over limited network links or on low-bandwidth GPU instances, offloading KV-cache fetch and decompression to SmartNICs can cut per-token latency and improve throughput without changing compression code.
Summary TLDR
ShadowServe offloads KV-cache network fetch and decompression from host GPU/CPU to a SmartNIC data plane. It uses an asynchronous control plane, a chunked pipeline across SmartNIC resources, and minimal-copy memory management to avoid GPU/CPU interference. In low-bandwidth setups (≤20 Gbps) ShadowServe lowers first-token latency and per-token decoding latency and raises throughput versus GPU-decompression baselines; SmartNIC memory limits its peak fetching rate in high-bandwidth scenarios.
Problem Statement
Distributed prefix caching helps avoid expensive LLM prefill work by fetching precomputed KV caches. But fetching compressed KV caches and decompressing them on the GPU or host CPU creates heavy interference with model compute or overloads CPUs. The paper asks: can we fetch and decompress without causing that interference while still keeping SmartNIC resource use efficient?
Main Contribution
Identify and measure strong bidirectional interference when GPU decompresses KV cache concurrently with model decode.
Design ShadowServe: separate host control plane and a SmartNIC-only data plane that fetches, decompresses, dequantizes, and DMA-copies KV cache into GPU memory.
Introduce a chunked pipeline and occupancy-based minimal-copy memory management to parallelize work on constrained SmartNIC cores and accelerators.
Implement a prototype on NVIDIA BlueField-3 and show lower per-token latency and improved throughput in low-bandwidth settings, plus analysis of SmartNIC bottlenecks.
Key Findings
Offloading decompression to the SmartNIC cuts per-output-token latency under load.
ShadowServe reduces time-to-first-token (TTFT) in low-bandwidth setups (≤20 Gbps).
Overall throughput improves for many low-bandwidth or long-output settings.
SmartNIC memory subsystem creates a new bottleneck when stages run together.
Results
unloaded TTFT
loaded TPOT
maximum throughput
SmartNIC pipeline network throughput
Who Should Care
What To Try In 7 Days
Measure TPOT and TTFT with/without local GV-cache decompression to see interference.
If you have SmartNICs with decompression + P2P DMA, prototype offloading a fetch+decompress path for compressed KV chunks.
Enable asynchronous fetching in your serving scheduler and add a background fetch queue to hide I/O latency.
Optimization Features
Infra Optimization
- SmartNIC data plane offload
- use of hardware decompression accelerators
System Optimization
- minimal-copy memory management
- resource partitioning on SmartNIC
- peer-to-peer DMA
Inference Optimization
- decompression offload
- asynchronous fetching
- chunked pipelining
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Performance limited by SmartNIC memory/cache bandwidth; TTFT stops improving above ~20 Gbps on BlueField-3 (§6.3).
- Requires SmartNICs with hardware decompression and peer-to-peer DMA; not useful on hosts without such hardware.
- Prototype does not implement chunked prefill or partial hits; those are noted as orthogonal and future work (§4.1).
- ShadowServe does not change compression algorithms or ratios; it only offloads existing compress/decompress steps.
When Not To Use
- Environments with very high network bandwidth (>40 Gbps) where SmartNICs become the bottleneck.
- Deployments without SmartNICs that support decompression accelerators and P2P DMA.
- Workloads dominated by extremely long outputs (>128 tokens) where fetch stalls dominate and both systems converge.
Failure Modes
- Inter-stage memory contention on SmartNIC reduces network throughput and increases TTFT.
- If SmartNIC buffer sizes or chunk size are misconfigured, DMA/scatter overhead increases and benefits shrink.
- GPU kernel scheduling behavior (streams/priorities) can non-deterministically affect interference and throughput.
- On-demand memory registration (if not using minimal-copy) causes large TTFT spikes.
Core Entities
Models
- Llama-8B
- Mistral-7B
Metrics
- TTFT
- TPOT
- throughput
- fetch latency
Datasets
- TriviaQA
- NarrativeQA
- LongBench
Benchmarks
- LongBench

