Microserving APIs and unified KV cache to reprogram LLM serving and cut job completion time by up to 47%

December 17, 20247 min

Overview

Decision SnapshotReady For Pilot

The design is implemented end-to-end on MLC-LLM and evaluated on real and synthetic workloads, showing clear benefits for long-context scenarios; adoption needs one-sided GPU primitives and router engineering.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

Links

Abstract / PDF

Why It Matters For Business

Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.

Who Should Care

Summary TLDR

This paper introduces "LLM microserving": a multi-level serving design that exposes three fine-grained REST APIs and a programmable Python router to split and reconfigure LLM inference steps (prefill vs decode) without restarting engines. A unified KV cache (KV = per-token attention state) lets engines transfer and reuse attention state via one-sided GPU writes (NVSHMEM), overlapping communication with compute. On long-input workloads the system reduces average job completion time (JCT) by ~21% and P99 JCT by up to 47% versus data-parallel baselines. The approach is most useful when inputs are long or when prefix reuse is common; gains are smaller on short-chat workloads.

Problem Statement

Current LLM serving systems expose a coarse request-level API with fixed coordination strategies. That makes it hard to try new disaggregation or cache-migration strategies at runtime: changing strategy often requires engine reconfiguration and service restarts. The paper proposes fine-grained sub-request APIs plus a programmable router to enable dynamic, low-disruption reconfiguration and efficient KV (attention-state) migration across GPUs.

Main Contribution

Design of LLM microserving: three fine-grained REST APIs (prep_recv, remote_send, start_generate) for sub-request actions.

Programmable async Python router that converts request-level calls into custom sub-request workflows, enabling dynamic reconfiguration without restarting engines.

Key Findings

Balanced prefill-decode disaggregation cuts tail job time on long inputs.

NumbersP99 JCT reduced up to 47% (synthetic long-input dataset)

Practical UseIf you serve long-input workloads, implement balanced P/D to shift part of prefill to decode engines and reduce tail latency.

Evidence RefFigure 11

Prefill-decode disaggregation reduces mean job time on long inputs.

NumbersMean JCT reduced up to 21% (synthetic long-input dataset)

Practical UseExpect ~20% average speedup by separating prefill and decode on workloads where prefills dominate compute.

Evidence RefFigure 11

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
P99 job completion timeup to 47% reductiondata-parallel (DP)−47% (synthetic long-input)synthetic dataset (mean input 3000, output 100)Balanced 1P1D reduces P99 JCT by up to 47%Figure 11
Mean job completion timeup to 21% reductiondata-parallel (DP)−21% (synthetic long-input)synthetic dataset (mean input 3000, output 100)Prefill-decode disaggregation lowers mean JCT vs DPFigure 11

What To Try In 7 Days

Prototype a router that issues prep_recv/remote_send/start_generate to reproduce P/D disaggregation on an existing inference cluster.

Measure where your workload is prefilling-heavy; test a small balance ratio (e.g., 0.2) and compare JCT and TTFT.

Enable one-sided GPU communication (NVSHMEM) if available and benchmark KV transfer overlap vs recomputation.

Optimization Features

Token Efficiency
KV reuse reduces repeated attention computation
Infra Optimization
One-sided GPU communication via NVSHMEMEager per-layer KV sends to hide latency
System Optimization
Programmable router for dynamic reconfigurationUnified KV cache API for diverse transfer/reuse patterns
Inference Optimization
Prefill-decode disaggregationBalanced prefill-decode (partial prefill on decode engine)KV migration to avoid recomputationOverlap per-layer KV transfer with attention compute

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Gains are workload-dependent: short-chat datasets (ShareGPT) show little or no benefit.

Requires one-sided GPU communication (NVSHMEM) for best overlap; not all infra supports this.

When Not To Use

Workloads with short inputs and outputs where prefill load is low

Environments lacking NVSHMEM or one-sided GPU write support

Failure Modes

Excessive KV transfer time that outlasts compute, causing stalls

Misconfigured PD balance that increases time-to-first-token (TTFT)

Core Entities

Models

Llama3.1 8B

Metrics

Time To First Token (TTFT)Time Per Output Token (TPOT)Job Completion Time (JCT)P99 JCT

Datasets

ShareGPTsynthetic (input mean 3000, output mean 100)