MemServe: a MemPool that adds context caching to disaggregated LLM serving, cutting job times and first-token delays

June 25, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

2

Authors

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

Links

Abstract / PDF

Why It Matters For Business

MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.

Summary TLDR

MemServe introduces MemPool, an elastic distributed memory layer (manages GPU HBM and CPU DRAM) plus a global scheduler to enable context caching and disaggregated inference together. MemPool exposes simple APIs (alloc, insert/match, transfer_with_insert) so inference engines can store, locate, and move KV cache (intermediate attention state). Practical optimizations include block aggregation (huge-page style) to cut network calls and an operator-level cost model for routing vs. transfer decisions. On an 8×H800 server with Llama2-13B, MemServe reduces job completion time (JCT) and time-to-first-token (TTFT) by up to ~42% and gives consistent gains when caching plus disaggregation are used.

Problem Statement

Existing LLM serving systems treat the KV cache as request-local, so context caching (reusing KV across requests) and disaggregated inference (splitting prefill/decode across machines) have conflicting or missing mechanisms for managing and moving KV cache. That prevents combining inter-request and intra-request optimizations and leaves scheduling blind to cross-instance cache locality.

Main Contribution

MemPool: a distributed elastic memory pool that manages GPU HBM and CPU DRAM and exposes memory, index, and transfer APIs for KV cache.

MemServe: a serving system that uses MemPool to combine context caching and disaggregated inference and adds a global scheduler with prompt-tree locality-aware routing.

Practical optimizations: transfer_with_insert, block aggregation (reduce small-block network calls), and an operator-level cost model to decide routing vs. KV transfer.

Key Findings

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

NumbersJCT improved up to 42% (P99) on ShareGPT

Adding context caching on top of disaggregation provides further speedups.

NumbersAdditional JCT improvement up to 29% and TTFT improvement up to 58% on ShareGPT

Block aggregation (combine many small KV blocks) cuts network overhead compared to naive small-block transfers.

NumbersNetwork performance improved substantially (figures 11–12 show large margin gains; aggregation reduces number of network

Operator-level cost model predicts prefill execution time more accurately and scales better across parallel configs than a whole-architecture fit.

NumbersOperator-level model yields lower prediction error and better scalability than arch-level (Figure 14)

Prompt-tree-based global scheduling boosts cross-session cache reuse and lowers tail TTFT.

NumbersPrompt-tree policy improved P99 TTFT by 59% vs intra-session scheduling in a 3P1D test

Results

JCT (average, ShareGPT)

Value30% improvement vs PD-colocated (disaggregation, 1P2D)

BaselinePD-colocated (vLLM)

JCT (P99, ShareGPT)

Value42% improvement vs PD-colocated (disaggregation)

BaselinePD-colocated (vLLM)

JCT (additional benefit from caching, ShareGPT)

Value17% avg and 29% P99 further improvement when adding caching to disaggregation

BaselineDisaggregated inference without caching

TTFT (avg and P99, ShareGPT)

Value58% avg and 45% P99 improvement when caching + disaggregation

BaselineDisaggregated inference without caching

JCT (LooGLE)

Value10.3% avg and 10.8% P99 improvement from disaggregation; caching adds 26.9% avg and 22.5% P99

BaselinePD-colocated

JCT (ReAct)

ValueDisaggregation increased JCT by 40.8% avg and 53.1% P99; caching recovers and improves by 26.7% avg and 21.4% P99

BaselinePD-colocated

Who Should Care

What To Try In 7 Days

Prototype MemPool-style index for KV cache: track prefix-to-KV mappings and measure TTFT gains on your chat logs.

Run a single-server experiment: split prefill/decode on an extra instance and use transfer_with_insert to test reduced JCT.

Apply block aggregation for KV transfers: combine many small blocks into large transfers and measure network API call reduction and throughput.

Agent Features

Memory

  • elastic memory pool managing GPU HBM and CPU DRAM
  • block aggregation (huge-page style) to reduce fragmentation

Planning

  • global scheduler with prompt-tree-based locality-aware routing

Tool Use

  • MemPool APIs (alloc_mem, insert, match, transfer, transfer_with_insert)
  • NCCL send/recv for GPU transfers (prototype)

Frameworks

  • vLLM (adapted)
  • NCCL
  • radix tree indexing (from SGLang)

Architectures

  • prefill-only / decode-only / PD-colocated instances

Collaboration

  • global prompt trees coordinate cache locality across instances

Optimization Features

Token Efficiency

  • reuse of cached KV to avoid recomputation for shared prefixes

Infra Optimization

  • mem pool spanning HBM and DRAM with swap_in/swap_out
  • use of NVLink on DGX H800 for fast HBM transfers (prototype)

System Optimization

  • block aggregation to reduce small-network transfers
  • transfer_with_insert to avoid extra network round-trips
  • operator-level cost model for routing vs transfer decisions

Inference Optimization

  • disaggregated inference (prefill/decode split)
  • context caching across requests (token-based radix index)
  • sequence parallel support via transfer API

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation runs on a single DGX H800 server; cross-machine RDMA/real datacenter behavior not measured.
  • Current prototype uses NCCL send/recv and sockets; NCCL lacks native gather/scatter and destination addressing.
  • Global scheduler can be stale: prompt-tree index updates only when responses pass through the GS and TTLs are needed.
  • Disaggregation can hurt some workloads (e.g., ReAct) unless combined with caching and careful scheduling.

When Not To Use

  • Workloads with little or no prompt prefix sharing (low cached-ratio) — caching gives little benefit.
  • Environments without fast interconnects or RDMA where transfer cost outweighs reuse benefits.
  • Small models where KV cache sizes and transfer overheads do not amortize across requests.

Failure Modes

  • Stale global index leads to misrouting and failed reuse due to local evictions.
  • Excessive network API calls if memory is fragmented, causing higher latency under load.
  • Instance failure during transfer causes timeouts and possible memory leaks until cluster manager cleans up.

Core Entities

Models

  • Llama2-13B
  • vLLM (adapted)

Metrics

  • Job Completion Time (JCT)
  • Time-to-First-Token (TTFT)
  • Time-per-Output-Token (TPOT)

Datasets

  • ShareGPT
  • LooGLE
  • ReAct (with HotpotQA traces)

Benchmarks

  • LooGLE (long-context benchmark)