Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
DualMap can serve more latency-sensitive requests and lower per-request compute cost by combining cache reuse with balanced load, improving throughput and reducing tail latency under real skewed workloads.
Summary TLDR
DualMap is a scheduling layer for distributed LLM serving that maps each request to two candidate instances (two independent hashes) and picks between them using SLO-aware rules. It preserves KV-cache reuse (cache affinity) while using the 'power of two choices' to balance load. Key techniques: adaptive prefix length, SLO-aware routing (prefer cache reuse until TTFT risk), hotspot-aware rebalancing (migrate within the two candidates), and dual-hash-ring scaling for low-disruption elasticity. On real traces, DualMap raises effective request capacity up to 2.25× and cuts P50/P90 TTFT substantially versus baselines.
Problem Statement
In distributed LLM serving, routing for KV-cache reuse (cache affinity) tends to concentrate popular prefixes on a few nodes and creates hotspots. Pure load-based routing scatters prefixes and forces recomputation. Existing single-mapping schedulers trade one objective for the other and cannot guarantee both low time-to-first-token (TTFT) and even load under real, skewed workloads.
Main Contribution
DualMap: a dual-mapping scheduler that assigns two prefix-bound candidate instances per request and picks one at dispatch time.
SLO-aware routing: prefer cache reuse until expected TTFT would exceed the SLO, then switch to load-aware choice.
Hotspot-aware rebalancing: selective, non-recursive batch migration within each request's two candidates to relieve overloads.
Dual-hash-ring scaling: consistent dual-hash ring that limits remapping during add/remove instance operations.
Implementation and evaluation on real traces with vLLM showing practical latency and capacity gains.
Key Findings
DualMap increases effective request capacity up to 2.25× versus state-of-the-art schedulers on evaluated traces.
On the Tool&Agent trace DualMap raised effective request capacity by as much as 125% and goodput by 16.7–48% versus the best baseline.
DualMap reduces P50 TTFT by 55.4%–97.4% and P90 TTFT by 82.3%–97% versus baselines on evaluated settings.
DualMap achieves cache hit rates close to cache-affinity: 62.5% and 96.4% of the theoretical upper bound on Conversation and Tool&Agent.
Scheduling overhead is small: per-request routing ~0.6 ms, KV cache queries ~0.2 ms, rebalancing invoked cost 2.2–2.5 ms.
DualMap scales near-linearly in goodput from 8 to 32 instances on the simulated workload.
Results
Effective Request Capacity
Goodput
P50 TTFT
P90 TTFT
Cache Hit Rate
Scheduler runtime overhead
Who Should Care
What To Try In 7 Days
Measure prefix-sharing in your traces (shared-prefix rate) to estimate cache opportunities.
Prototype dual-hash mapping for prefixes and track cache hit rate vs queue lengths.
Add an SLO threshold: compute a pending-token TTFT threshold and switch to less-loaded candidate when exceeded.
Optimization Features
Token Efficiency
- reduces redundant prefill compute via KV-cache reuse
- lowers pending prefill token counts per instance
Infra Optimization
- small metadata footprint per instance (≈146 KB for 7B)
- scheduling ops independent of cluster size
System Optimization
- improves load balance while keeping cache locality
- limits remapping during scaling to local regions
Inference Optimization
- dual-mapping scheduling (two candidate instances)
- SLO-aware request routing (preserve cache reuse until SLO risk)
- hotspot-aware batch migration within candidate pair
- dual-hash-ring consistent scaling
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on accurate TTFT estimation; misestimation under heavy decode bottlenecks can hurt decisions (§A.7).
- Evaluations use two Mooncake traces and specific Ascend NPU hardware; results may vary on different workloads or hardware.
- Hotspots can still form; DualMap uses intra-pair migration only, which may not fully solve extreme skew across many hot prefixes.
When Not To Use
- Workloads with almost no prefix sharing (no KV-cache benefits).
- Very small clusters where two-choice gains are minimal.
- Environments where TTFT cannot be estimated or prefill/decode separation is unavailable.
Failure Modes
- Incorrect TTFT estimates cause wrong switches from cache-aware to load-aware routing, increasing recomputation.
- Frequent migrations if thresholds are tuned too aggressively, reducing cache reuse and increasing overhead.
- Extreme and rapidly shifting skew outside the adaptive prefix mechanism's reaction window may still create SLO violations.
Core Entities
Models
- Qwen2.5-7B
- Qwen2.5-14B
Metrics
- Effective Request Capacity
- Goodput
- P50 TTFT
- P90 TTFT
- P50 E2E
- P90 E2E
- Cache Hit Rate
- Load Balance Ratio (CV)
Datasets
- Conversation (Mooncake trace)
- Tool&Agent (Mooncake trace)

