MemServe: a MemPool that adds context caching to disaggregated LLM serving, cutting job times and first-token delays

Overview

Decision SnapshotReady For Pilot

The system is implemented end-to-end and evaluated on an 8×H800 server with Llama2-13B and real workloads. Results show meaningful latency gains, but experiments run on a single machine and use NCCL point-to-point as a prototype network stack.

Citations2

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

Links

Abstract / PDF

Why It Matters For Business

MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

MemServe introduces MemPool, an elastic distributed memory layer (manages GPU HBM and CPU DRAM) plus a global scheduler to enable context caching and disaggregated inference together. MemPool exposes simple APIs (alloc, insert/match, transfer_with_insert) so inference engines can store, locate, and move KV cache (intermediate attention state). Practical optimizations include block aggregation (huge-page style) to cut network calls and an operator-level cost model for routing vs. transfer decisions. On an 8×H800 server with Llama2-13B, MemServe reduces job completion time (JCT) and time-to-first-token (TTFT) by up to ~42% and gives consistent gains when caching plus disaggregation are used.

Problem Statement

Existing LLM serving systems treat the KV cache as request-local, so context caching (reusing KV across requests) and disaggregated inference (splitting prefill/decode across machines) have conflicting or missing mechanisms for managing and moving KV cache. That prevents combining inter-request and intra-request optimizations and leaves scheduling blind to cross-instance cache locality.

Main Contribution

MemPool: a distributed elastic memory pool that manages GPU HBM and CPU DRAM and exposes memory, index, and transfer APIs for KV cache.

MemServe: a serving system that uses MemPool to combine context caching and disaggregated inference and adds a global scheduler with prompt-tree locality-aware routing.

Key Findings

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

NumbersJCT improved up to 42% (P99) on ShareGPT

Practical UseIf you run many chat-like requests, split prefill/decode and use MemPool APIs to cut JCT up to ~40% on tested workloads.

Evidence RefAbstract; §8.3 End-to-End; ShareGPT results

Adding context caching on top of disaggregation provides further speedups.

NumbersAdditional JCT improvement up to 29% and TTFT improvement up to 58% on ShareGPT

Practical UseEnable context caching (insert/match and transfer_with_insert) to further lower latency and time-to-first-token for workloads with shared prefixes.

Evidence Ref§8.3 End-to-End; ShareGPT results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
JCT (average, ShareGPT)	30% improvement vs PD-colocated (disaggregation, 1P2D)	PD-colocated (vLLM)	−30%	ShareGPT	§8.3 End-to-End; ShareGPT paragraph	Figure 8; §8.3
JCT (P99, ShareGPT)	42% improvement vs PD-colocated (disaggregation)	PD-colocated (vLLM)	−42%	ShareGPT	Abstract; §8.3	Figure 8; §8.3

What To Try In 7 Days

Prototype MemPool-style index for KV cache: track prefix-to-KV mappings and measure TTFT gains on your chat logs.

Run a single-server experiment: split prefill/decode on an extra instance and use transfer_with_insert to test reduced JCT.

Apply block aggregation for KV transfers: combine many small blocks into large transfers and measure network API call reduction and throughput.

Agent Features

Memory

elastic memory pool managing GPU HBM and CPU DRAMblock aggregation (huge-page style) to reduce fragmentation

Planning

global scheduler with prompt-tree-based locality-aware routing

Tool Use

MemPool APIs (alloc_mem, insert, match, transfer, transfer_with_insert)NCCL send/recv for GPU transfers (prototype)

Frameworks

vLLM (adapted)NCCLradix tree indexing (from SGLang)

Architectures

prefill-only / decode-only / PD-colocated instances

Collaboration

global prompt trees coordinate cache locality across instances

Optimization Features

Token Efficiency

reuse of cached KV to avoid recomputation for shared prefixes

Infra Optimization

mem pool spanning HBM and DRAM with swap_in/swap_outuse of NVLink on DGX H800 for fast HBM transfers (prototype)

System Optimization

block aggregation to reduce small-network transferstransfer_with_insert to avoid extra network round-tripsoperator-level cost model for routing vs transfer decisions

Inference Optimization

disaggregated inference (prefill/decode split)context caching across requests (token-based radix index)sequence parallel support via transfer API

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation runs on a single DGX H800 server; cross-machine RDMA/real datacenter behavior not measured.

Current prototype uses NCCL send/recv and sockets; NCCL lacks native gather/scatter and destination addressing.

When Not To Use

Workloads with little or no prompt prefix sharing (low cached-ratio) — caching gives little benefit.

Environments without fast interconnects or RDMA where transfer cost outweighs reuse benefits.

Failure Modes

Stale global index leads to misrouting and failed reuse due to local evictions.

Excessive network API calls if memory is fragmented, causing higher latency under load.

Core Entities

Models

Llama2-13BvLLM (adapted)

Metrics

Job Completion Time (JCT)Time-to-First-Token (TTFT)Time-per-Output-Token (TPOT)

Datasets

ShareGPTLooGLEReAct (with HotpotQA traces)

Benchmarks

LooGLE (long-context benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

Adding context caching on top of disaggregation provides further speedups.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding