MemServe: a MemPool that adds context caching to disaggregated LLM serving, cutting job times and first-token delays

June 25, 20248 min

Overview

Decision SnapshotReady For Pilot

The system is implemented end-to-end and evaluated on an 8×H800 server with Llama2-13B and real workloads. Results show meaningful latency gains, but experiments run on a single machine and use NCCL point-to-point as a prototype network stack.

Citations2

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

Links

Abstract / PDF

Why It Matters For Business

MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.

Who Should Care

Summary TLDR

MemServe introduces MemPool, an elastic distributed memory layer (manages GPU HBM and CPU DRAM) plus a global scheduler to enable context caching and disaggregated inference together. MemPool exposes simple APIs (alloc, insert/match, transfer_with_insert) so inference engines can store, locate, and move KV cache (intermediate attention state). Practical optimizations include block aggregation (huge-page style) to cut network calls and an operator-level cost model for routing vs. transfer decisions. On an 8×H800 server with Llama2-13B, MemServe reduces job completion time (JCT) and time-to-first-token (TTFT) by up to ~42% and gives consistent gains when caching plus disaggregation are used.

Problem Statement

Existing LLM serving systems treat the KV cache as request-local, so context caching (reusing KV across requests) and disaggregated inference (splitting prefill/decode across machines) have conflicting or missing mechanisms for managing and moving KV cache. That prevents combining inter-request and intra-request optimizations and leaves scheduling blind to cross-instance cache locality.

Main Contribution

MemPool: a distributed elastic memory pool that manages GPU HBM and CPU DRAM and exposes memory, index, and transfer APIs for KV cache.

MemServe: a serving system that uses MemPool to combine context caching and disaggregated inference and adds a global scheduler with prompt-tree locality-aware routing.

Key Findings

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

NumbersJCT improved up to 42% (P99) on ShareGPT

Practical UseIf you run many chat-like requests, split prefill/decode and use MemPool APIs to cut JCT up to ~40% on tested workloads.

Evidence RefAbstract; §8.3 End-to-End; ShareGPT results

Adding context caching on top of disaggregation provides further speedups.

NumbersAdditional JCT improvement up to 29% and TTFT improvement up to 58% on ShareGPT

Practical UseEnable context caching (insert/match and transfer_with_insert) to further lower latency and time-to-first-token for workloads with shared prefixes.

Evidence Ref§8.3 End-to-End; ShareGPT results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
JCT (average, ShareGPT)30% improvement vs PD-colocated (disaggregation, 1P2D)PD-colocated (vLLM)−30%ShareGPT§8.3 End-to-End; ShareGPT paragraphFigure 8; §8.3
JCT (P99, ShareGPT)42% improvement vs PD-colocated (disaggregation)PD-colocated (vLLM)−42%ShareGPTAbstract; §8.3Figure 8; §8.3

What To Try In 7 Days

Prototype MemPool-style index for KV cache: track prefix-to-KV mappings and measure TTFT gains on your chat logs.

Run a single-server experiment: split prefill/decode on an extra instance and use transfer_with_insert to test reduced JCT.

Apply block aggregation for KV transfers: combine many small blocks into large transfers and measure network API call reduction and throughput.

Agent Features

Memory
elastic memory pool managing GPU HBM and CPU DRAMblock aggregation (huge-page style) to reduce fragmentation
Planning
global scheduler with prompt-tree-based locality-aware routing
Tool Use
MemPool APIs (alloc_mem, insert, match, transfer, transfer_with_insert)NCCL send/recv for GPU transfers (prototype)
Frameworks
vLLM (adapted)NCCLradix tree indexing (from SGLang)
Architectures
prefill-only / decode-only / PD-colocated instances
Collaboration
global prompt trees coordinate cache locality across instances

Optimization Features

Token Efficiency
reuse of cached KV to avoid recomputation for shared prefixes
Infra Optimization
mem pool spanning HBM and DRAM with swap_in/swap_outuse of NVLink on DGX H800 for fast HBM transfers (prototype)
System Optimization
block aggregation to reduce small-network transferstransfer_with_insert to avoid extra network round-tripsoperator-level cost model for routing vs transfer decisions
Inference Optimization
disaggregated inference (prefill/decode split)context caching across requests (token-based radix index)sequence parallel support via transfer API

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation runs on a single DGX H800 server; cross-machine RDMA/real datacenter behavior not measured.

Current prototype uses NCCL send/recv and sockets; NCCL lacks native gather/scatter and destination addressing.

When Not To Use

Workloads with little or no prompt prefix sharing (low cached-ratio) — caching gives little benefit.

Environments without fast interconnects or RDMA where transfer cost outweighs reuse benefits.

Failure Modes

Stale global index leads to misrouting and failed reuse due to local evictions.

Excessive network API calls if memory is fragmented, causing higher latency under load.

Core Entities

Models

Llama2-13BvLLM (adapted)

Metrics

Job Completion Time (JCT)Time-to-First-Token (TTFT)Time-per-Output-Token (TPOT)

Datasets

ShareGPTLooGLEReAct (with HotpotQA traces)

Benchmarks

LooGLE (long-context benchmark)