Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.
Summary TLDR
This paper introduces "LLM microserving": a multi-level serving design that exposes three fine-grained REST APIs and a programmable Python router to split and reconfigure LLM inference steps (prefill vs decode) without restarting engines. A unified KV cache (KV = per-token attention state) lets engines transfer and reuse attention state via one-sided GPU writes (NVSHMEM), overlapping communication with compute. On long-input workloads the system reduces average job completion time (JCT) by ~21% and P99 JCT by up to 47% versus data-parallel baselines. The approach is most useful when inputs are long or when prefix reuse is common; gains are smaller on short-chat workloads.
Problem Statement
Current LLM serving systems expose a coarse request-level API with fixed coordination strategies. That makes it hard to try new disaggregation or cache-migration strategies at runtime: changing strategy often requires engine reconfiguration and service restarts. The paper proposes fine-grained sub-request APIs plus a programmable router to enable dynamic, low-disruption reconfiguration and efficient KV (attention-state) migration across GPUs.
Main Contribution
Design of LLM microserving: three fine-grained REST APIs (prep_recv, remote_send, start_generate) for sub-request actions.
Programmable async Python router that converts request-level calls into custom sub-request workflows, enabling dynamic reconfiguration without restarting engines.
A unified KV cache interface that supports prefix-matching, KV transfer, reuse, and overlapping one-sided GPU communication (NVSHMEM) to reduce recomputation.
Key Findings
Balanced prefill-decode disaggregation cuts tail job time on long inputs.
Prefill-decode disaggregation reduces mean job time on long inputs.
KV migration avoids recomputing cached context and speeds up prefill.
KV transfer can be overlapped with compute; overlap grows with longer context.
Results
P99 job completion time
Mean job completion time
Prefill time with KV migration
Per-layer compute vs transfer
Who Should Care
What To Try In 7 Days
Prototype a router that issues prep_recv/remote_send/start_generate to reproduce P/D disaggregation on an existing inference cluster.
Measure where your workload is prefilling-heavy; test a small balance ratio (e.g., 0.2) and compare JCT and TTFT.
Enable one-sided GPU communication (NVSHMEM) if available and benchmark KV transfer overlap vs recomputation.
Optimization Features
Token Efficiency
- KV reuse reduces repeated attention computation
Infra Optimization
- One-sided GPU communication via NVSHMEM
- Eager per-layer KV sends to hide latency
System Optimization
- Programmable router for dynamic reconfiguration
- Unified KV cache API for diverse transfer/reuse patterns
Inference Optimization
- Prefill-decode disaggregation
- Balanced prefill-decode (partial prefill on decode engine)
- KV migration to avoid recomputation
- Overlap per-layer KV transfer with attention compute
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Gains are workload-dependent: short-chat datasets (ShareGPT) show little or no benefit.
- Requires one-sided GPU communication (NVSHMEM) for best overlap; not all infra supports this.
- KV transfer can become a bottleneck if transfer time exceeds per-layer compute, reducing overlap benefits.
When Not To Use
- Workloads with short inputs and outputs where prefill load is low
- Environments lacking NVSHMEM or one-sided GPU write support
- Small single-GPU deployments where disaggregation adds overhead
Failure Modes
- Excessive KV transfer time that outlasts compute, causing stalls
- Misconfigured PD balance that increases time-to-first-token (TTFT)
- Router bugs or stale radix-tree mapping causing cache misses and recomputation
Core Entities
Models
- Llama3.1 8B
Metrics
- Time To First Token (TTFT)
- Time Per Output Token (TPOT)
- Job Completion Time (JCT)
- P99 JCT
Datasets
- ShareGPT
- synthetic (input mean 3000, output mean 100)

