Overview
The system is implemented end-to-end and evaluated on an 8×H800 server with Llama2-13B and real workloads. Results show meaningful latency gains, but experiments run on a single machine and use NCCL point-to-point as a prototype network stack.
Citations2
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.
Who Should Care
Summary TLDR
MemServe introduces MemPool, an elastic distributed memory layer (manages GPU HBM and CPU DRAM) plus a global scheduler to enable context caching and disaggregated inference together. MemPool exposes simple APIs (alloc, insert/match, transfer_with_insert) so inference engines can store, locate, and move KV cache (intermediate attention state). Practical optimizations include block aggregation (huge-page style) to cut network calls and an operator-level cost model for routing vs. transfer decisions. On an 8×H800 server with Llama2-13B, MemServe reduces job completion time (JCT) and time-to-first-token (TTFT) by up to ~42% and gives consistent gains when caching plus disaggregation are used.
Problem Statement
Existing LLM serving systems treat the KV cache as request-local, so context caching (reusing KV across requests) and disaggregated inference (splitting prefill/decode across machines) have conflicting or missing mechanisms for managing and moving KV cache. That prevents combining inter-request and intra-request optimizations and leaves scheduling blind to cross-instance cache locality.
Main Contribution
MemPool: a distributed elastic memory pool that manages GPU HBM and CPU DRAM and exposes memory, index, and transfer APIs for KV cache.
MemServe: a serving system that uses MemPool to combine context caching and disaggregated inference and adds a global scheduler with prompt-tree locality-aware routing.
Key Findings
Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.
Adding context caching on top of disaggregation provides further speedups.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| JCT (average, ShareGPT) | 30% improvement vs PD-colocated (disaggregation, 1P2D) | PD-colocated (vLLM) | −30% | ShareGPT | §8.3 End-to-End; ShareGPT paragraph | Figure 8; §8.3 |
| JCT (P99, ShareGPT) | 42% improvement vs PD-colocated (disaggregation) | PD-colocated (vLLM) | −42% | ShareGPT | Abstract; §8.3 | Figure 8; §8.3 |
What To Try In 7 Days
Prototype MemPool-style index for KV cache: track prefix-to-KV mappings and measure TTFT gains on your chat logs.
Run a single-server experiment: split prefill/decode on an extra instance and use transfer_with_insert to test reduced JCT.
Apply block aggregation for KV transfers: combine many small blocks into large transfers and measure network API call reduction and throughput.
Agent Features
Memory
Planning
Tool Use
Frameworks
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation runs on a single DGX H800 server; cross-machine RDMA/real datacenter behavior not measured.
Current prototype uses NCCL send/recv and sockets; NCCL lacks native gather/scatter and destination addressing.
When Not To Use
Workloads with little or no prompt prefix sharing (low cached-ratio) — caching gives little benefit.
Environments without fast interconnects or RDMA where transfer cost outweighs reuse benefits.
Failure Modes
Stale global index leads to misrouting and failed reuse due to local evictions.
Excessive network API calls if memory is fragmented, causing higher latency under load.

