Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.
Summary TLDR
MemServe introduces MemPool, an elastic distributed memory layer (manages GPU HBM and CPU DRAM) plus a global scheduler to enable context caching and disaggregated inference together. MemPool exposes simple APIs (alloc, insert/match, transfer_with_insert) so inference engines can store, locate, and move KV cache (intermediate attention state). Practical optimizations include block aggregation (huge-page style) to cut network calls and an operator-level cost model for routing vs. transfer decisions. On an 8×H800 server with Llama2-13B, MemServe reduces job completion time (JCT) and time-to-first-token (TTFT) by up to ~42% and gives consistent gains when caching plus disaggregation are used.
Problem Statement
Existing LLM serving systems treat the KV cache as request-local, so context caching (reusing KV across requests) and disaggregated inference (splitting prefill/decode across machines) have conflicting or missing mechanisms for managing and moving KV cache. That prevents combining inter-request and intra-request optimizations and leaves scheduling blind to cross-instance cache locality.
Main Contribution
MemPool: a distributed elastic memory pool that manages GPU HBM and CPU DRAM and exposes memory, index, and transfer APIs for KV cache.
MemServe: a serving system that uses MemPool to combine context caching and disaggregated inference and adds a global scheduler with prompt-tree locality-aware routing.
Practical optimizations: transfer_with_insert, block aggregation (reduce small-block network calls), and an operator-level cost model to decide routing vs. KV transfer.
Key Findings
Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.
Adding context caching on top of disaggregation provides further speedups.
Block aggregation (combine many small KV blocks) cuts network overhead compared to naive small-block transfers.
Operator-level cost model predicts prefill execution time more accurately and scales better across parallel configs than a whole-architecture fit.
Prompt-tree-based global scheduling boosts cross-session cache reuse and lowers tail TTFT.
Results
JCT (average, ShareGPT)
JCT (P99, ShareGPT)
JCT (additional benefit from caching, ShareGPT)
TTFT (avg and P99, ShareGPT)
JCT (LooGLE)
JCT (ReAct)
Who Should Care
What To Try In 7 Days
Prototype MemPool-style index for KV cache: track prefix-to-KV mappings and measure TTFT gains on your chat logs.
Run a single-server experiment: split prefill/decode on an extra instance and use transfer_with_insert to test reduced JCT.
Apply block aggregation for KV transfers: combine many small blocks into large transfers and measure network API call reduction and throughput.
Agent Features
Memory
- elastic memory pool managing GPU HBM and CPU DRAM
- block aggregation (huge-page style) to reduce fragmentation
Planning
- global scheduler with prompt-tree-based locality-aware routing
Tool Use
- MemPool APIs (alloc_mem, insert, match, transfer, transfer_with_insert)
- NCCL send/recv for GPU transfers (prototype)
Frameworks
- vLLM (adapted)
- NCCL
- radix tree indexing (from SGLang)
Architectures
- prefill-only / decode-only / PD-colocated instances
Collaboration
- global prompt trees coordinate cache locality across instances
Optimization Features
Token Efficiency
- reuse of cached KV to avoid recomputation for shared prefixes
Infra Optimization
- mem pool spanning HBM and DRAM with swap_in/swap_out
- use of NVLink on DGX H800 for fast HBM transfers (prototype)
System Optimization
- block aggregation to reduce small-network transfers
- transfer_with_insert to avoid extra network round-trips
- operator-level cost model for routing vs transfer decisions
Inference Optimization
- disaggregated inference (prefill/decode split)
- context caching across requests (token-based radix index)
- sequence parallel support via transfer API
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluation runs on a single DGX H800 server; cross-machine RDMA/real datacenter behavior not measured.
- Current prototype uses NCCL send/recv and sockets; NCCL lacks native gather/scatter and destination addressing.
- Global scheduler can be stale: prompt-tree index updates only when responses pass through the GS and TTLs are needed.
- Disaggregation can hurt some workloads (e.g., ReAct) unless combined with caching and careful scheduling.
When Not To Use
- Workloads with little or no prompt prefix sharing (low cached-ratio) — caching gives little benefit.
- Environments without fast interconnects or RDMA where transfer cost outweighs reuse benefits.
- Small models where KV cache sizes and transfer overheads do not amortize across requests.
Failure Modes
- Stale global index leads to misrouting and failed reuse due to local evictions.
- Excessive network API calls if memory is fragmented, causing higher latency under load.
- Instance failure during transfer causes timeouts and possible memory leaks until cluster manager cleans up.
Core Entities
Models
- Llama2-13B
- vLLM (adapted)
Metrics
- Job Completion Time (JCT)
- Time-to-First-Token (TTFT)
- Time-per-Output-Token (TPOT)
Datasets
- ShareGPT
- LooGLE
- ReAct (with HotpotQA traces)
Benchmarks
- LooGLE (long-context benchmark)

