Overview
The design is implemented end-to-end on MLC-LLM and evaluated on real and synthetic workloads, showing clear benefits for long-context scenarios; adoption needs one-sided GPU primitives and router engineering.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.
Who Should Care
Summary TLDR
This paper introduces "LLM microserving": a multi-level serving design that exposes three fine-grained REST APIs and a programmable Python router to split and reconfigure LLM inference steps (prefill vs decode) without restarting engines. A unified KV cache (KV = per-token attention state) lets engines transfer and reuse attention state via one-sided GPU writes (NVSHMEM), overlapping communication with compute. On long-input workloads the system reduces average job completion time (JCT) by ~21% and P99 JCT by up to 47% versus data-parallel baselines. The approach is most useful when inputs are long or when prefix reuse is common; gains are smaller on short-chat workloads.
Problem Statement
Current LLM serving systems expose a coarse request-level API with fixed coordination strategies. That makes it hard to try new disaggregation or cache-migration strategies at runtime: changing strategy often requires engine reconfiguration and service restarts. The paper proposes fine-grained sub-request APIs plus a programmable router to enable dynamic, low-disruption reconfiguration and efficient KV (attention-state) migration across GPUs.
Main Contribution
Design of LLM microserving: three fine-grained REST APIs (prep_recv, remote_send, start_generate) for sub-request actions.
Programmable async Python router that converts request-level calls into custom sub-request workflows, enabling dynamic reconfiguration without restarting engines.
Key Findings
Balanced prefill-decode disaggregation cuts tail job time on long inputs.
Prefill-decode disaggregation reduces mean job time on long inputs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| P99 job completion time | up to 47% reduction | data-parallel (DP) | −47% (synthetic long-input) | synthetic dataset (mean input 3000, output 100) | Balanced 1P1D reduces P99 JCT by up to 47% | Figure 11 |
| Mean job completion time | up to 21% reduction | data-parallel (DP) | −21% (synthetic long-input) | synthetic dataset (mean input 3000, output 100) | Prefill-decode disaggregation lowers mean JCT vs DP | Figure 11 |
What To Try In 7 Days
Prototype a router that issues prep_recv/remote_send/start_generate to reproduce P/D disaggregation on an existing inference cluster.
Measure where your workload is prefilling-heavy; test a small balance ratio (e.g., 0.2) and compare JCT and TTFT.
Enable one-sided GPU communication (NVSHMEM) if available and benchmark KV transfer overlap vs recomputation.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Gains are workload-dependent: short-chat datasets (ShareGPT) show little or no benefit.
Requires one-sided GPU communication (NVSHMEM) for best overlap; not all infra supports this.
When Not To Use
Workloads with short inputs and outputs where prefill load is low
Environments lacking NVSHMEM or one-sided GPU write support
Failure Modes
Excessive KV transfer time that outlasts compute, causing stalls
Misconfigured PD balance that increases time-to-first-token (TTFT)

