Microserving APIs and unified KV cache to reprogram LLM serving and cut job completion time by up to 47%

December 17, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

Links

Abstract / PDF

Why It Matters For Business

Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.

Summary TLDR

This paper introduces "LLM microserving": a multi-level serving design that exposes three fine-grained REST APIs and a programmable Python router to split and reconfigure LLM inference steps (prefill vs decode) without restarting engines. A unified KV cache (KV = per-token attention state) lets engines transfer and reuse attention state via one-sided GPU writes (NVSHMEM), overlapping communication with compute. On long-input workloads the system reduces average job completion time (JCT) by ~21% and P99 JCT by up to 47% versus data-parallel baselines. The approach is most useful when inputs are long or when prefix reuse is common; gains are smaller on short-chat workloads.

Problem Statement

Current LLM serving systems expose a coarse request-level API with fixed coordination strategies. That makes it hard to try new disaggregation or cache-migration strategies at runtime: changing strategy often requires engine reconfiguration and service restarts. The paper proposes fine-grained sub-request APIs plus a programmable router to enable dynamic, low-disruption reconfiguration and efficient KV (attention-state) migration across GPUs.

Main Contribution

Design of LLM microserving: three fine-grained REST APIs (prep_recv, remote_send, start_generate) for sub-request actions.

Programmable async Python router that converts request-level calls into custom sub-request workflows, enabling dynamic reconfiguration without restarting engines.

A unified KV cache interface that supports prefix-matching, KV transfer, reuse, and overlapping one-sided GPU communication (NVSHMEM) to reduce recomputation.

Key Findings

Balanced prefill-decode disaggregation cuts tail job time on long inputs.

NumbersP99 JCT reduced up to 47% (synthetic long-input dataset)

Prefill-decode disaggregation reduces mean job time on long inputs.

NumbersMean JCT reduced up to 21% (synthetic long-input dataset)

KV migration avoids recomputing cached context and speeds up prefill.

NumbersPrefill time 1.7× faster when migrating KV for 1000-token input

KV transfer can be overlapped with compute; overlap grows with longer context.

NumbersTransfer overlap ratio 15.8%→55.4% as input length 1000→5000 tokens

Results

P99 job completion time

Valueup to 47% reduction

Baselinedata-parallel (DP)

Mean job completion time

Valueup to 21% reduction

Baselinedata-parallel (DP)

Prefill time with KV migration

Value1.7× faster (prefill time halved)

Baselinerecompute full KV without migration

Per-layer compute vs transfer

ValueT_layer: 1.247–1.564 ms; T_KV_transfer: 0.197–0.867 ms

Who Should Care

What To Try In 7 Days

Prototype a router that issues prep_recv/remote_send/start_generate to reproduce P/D disaggregation on an existing inference cluster.

Measure where your workload is prefilling-heavy; test a small balance ratio (e.g., 0.2) and compare JCT and TTFT.

Enable one-sided GPU communication (NVSHMEM) if available and benchmark KV transfer overlap vs recomputation.

Optimization Features

Token Efficiency

  • KV reuse reduces repeated attention computation

Infra Optimization

  • One-sided GPU communication via NVSHMEM
  • Eager per-layer KV sends to hide latency

System Optimization

  • Programmable router for dynamic reconfiguration
  • Unified KV cache API for diverse transfer/reuse patterns

Inference Optimization

  • Prefill-decode disaggregation
  • Balanced prefill-decode (partial prefill on decode engine)
  • KV migration to avoid recomputation
  • Overlap per-layer KV transfer with attention compute

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Gains are workload-dependent: short-chat datasets (ShareGPT) show little or no benefit.
  • Requires one-sided GPU communication (NVSHMEM) for best overlap; not all infra supports this.
  • KV transfer can become a bottleneck if transfer time exceeds per-layer compute, reducing overlap benefits.

When Not To Use

  • Workloads with short inputs and outputs where prefill load is low
  • Environments lacking NVSHMEM or one-sided GPU write support
  • Small single-GPU deployments where disaggregation adds overhead

Failure Modes

  • Excessive KV transfer time that outlasts compute, causing stalls
  • Misconfigured PD balance that increases time-to-first-token (TTFT)
  • Router bugs or stale radix-tree mapping causing cache misses and recomputation

Core Entities

Models

  • Llama3.1 8B

Metrics

  • Time To First Token (TTFT)
  • Time Per Output Token (TPOT)
  • Job Completion Time (JCT)
  • P99 JCT

Datasets

  • ShareGPT
  • synthetic (input mean 3000, output mean 100)