Microserving APIs and unified KV cache to reprogram LLM serving and cut job completion time by up to 47%

Overview

Decision SnapshotReady For Pilot

The design is implemented end-to-end on MLC-LLM and evaluated on real and synthetic workloads, showing clear benefits for long-context scenarios; adoption needs one-sided GPU primitives and router engineering.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

Links

Abstract / PDF

Why It Matters For Business

Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Data Scientist

Summary TLDR

This paper introduces "LLM microserving": a multi-level serving design that exposes three fine-grained REST APIs and a programmable Python router to split and reconfigure LLM inference steps (prefill vs decode) without restarting engines. A unified KV cache (KV = per-token attention state) lets engines transfer and reuse attention state via one-sided GPU writes (NVSHMEM), overlapping communication with compute. On long-input workloads the system reduces average job completion time (JCT) by ~21% and P99 JCT by up to 47% versus data-parallel baselines. The approach is most useful when inputs are long or when prefix reuse is common; gains are smaller on short-chat workloads.

Problem Statement

Current LLM serving systems expose a coarse request-level API with fixed coordination strategies. That makes it hard to try new disaggregation or cache-migration strategies at runtime: changing strategy often requires engine reconfiguration and service restarts. The paper proposes fine-grained sub-request APIs plus a programmable router to enable dynamic, low-disruption reconfiguration and efficient KV (attention-state) migration across GPUs.

Main Contribution

Design of LLM microserving: three fine-grained REST APIs (prep_recv, remote_send, start_generate) for sub-request actions.

Programmable async Python router that converts request-level calls into custom sub-request workflows, enabling dynamic reconfiguration without restarting engines.

Key Findings

Balanced prefill-decode disaggregation cuts tail job time on long inputs.

NumbersP99 JCT reduced up to 47% (synthetic long-input dataset)

Practical UseIf you serve long-input workloads, implement balanced P/D to shift part of prefill to decode engines and reduce tail latency.

Evidence RefFigure 11

Prefill-decode disaggregation reduces mean job time on long inputs.

NumbersMean JCT reduced up to 21% (synthetic long-input dataset)

Practical UseExpect ~20% average speedup by separating prefill and decode on workloads where prefills dominate compute.

Evidence RefFigure 11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
P99 job completion time	up to 47% reduction	data-parallel (DP)	−47% (synthetic long-input)	synthetic dataset (mean input 3000, output 100)	Balanced 1P1D reduces P99 JCT by up to 47%	Figure 11
Mean job completion time	up to 21% reduction	data-parallel (DP)	−21% (synthetic long-input)	synthetic dataset (mean input 3000, output 100)	Prefill-decode disaggregation lowers mean JCT vs DP	Figure 11

What To Try In 7 Days

Prototype a router that issues prep_recv/remote_send/start_generate to reproduce P/D disaggregation on an existing inference cluster.

Measure where your workload is prefilling-heavy; test a small balance ratio (e.g., 0.2) and compare JCT and TTFT.

Enable one-sided GPU communication (NVSHMEM) if available and benchmark KV transfer overlap vs recomputation.

Optimization Features

Token Efficiency

KV reuse reduces repeated attention computation

Infra Optimization

One-sided GPU communication via NVSHMEMEager per-layer KV sends to hide latency

System Optimization

Programmable router for dynamic reconfigurationUnified KV cache API for diverse transfer/reuse patterns

Inference Optimization

Prefill-decode disaggregationBalanced prefill-decode (partial prefill on decode engine)KV migration to avoid recomputationOverlap per-layer KV transfer with attention compute

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Gains are workload-dependent: short-chat datasets (ShareGPT) show little or no benefit.

Requires one-sided GPU communication (NVSHMEM) for best overlap; not all infra supports this.

When Not To Use

Workloads with short inputs and outputs where prefill load is low

Environments lacking NVSHMEM or one-sided GPU write support

Failure Modes

Excessive KV transfer time that outlasts compute, causing stalls

Misconfigured PD balance that increases time-to-first-token (TTFT)

Core Entities

Models

Llama3.1 8B

Metrics

Time To First Token (TTFT)Time Per Output Token (TPOT)Job Completion Time (JCT)P99 JCT

Datasets

ShareGPTsynthetic (input mean 3000, output mean 100)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Balanced prefill-decode disaggregation cuts tail job time on long inputs.

Prefill-decode disaggregation reduces mean job time on long inputs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding