DualMap: dual-hash scheduling that preserves KV-cache reuse while balancing load for LLM serving

February 6, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Ying Yuan, Pengfei Zuo, Bo Wang, Zhangyu Chen, Zhipeng Tan, Zhou Yu

Links

Abstract / PDF

Why It Matters For Business

DualMap can serve more latency-sensitive requests and lower per-request compute cost by combining cache reuse with balanced load, improving throughput and reducing tail latency under real skewed workloads.

Summary TLDR

DualMap is a scheduling layer for distributed LLM serving that maps each request to two candidate instances (two independent hashes) and picks between them using SLO-aware rules. It preserves KV-cache reuse (cache affinity) while using the 'power of two choices' to balance load. Key techniques: adaptive prefix length, SLO-aware routing (prefer cache reuse until TTFT risk), hotspot-aware rebalancing (migrate within the two candidates), and dual-hash-ring scaling for low-disruption elasticity. On real traces, DualMap raises effective request capacity up to 2.25× and cuts P50/P90 TTFT substantially versus baselines.

Problem Statement

In distributed LLM serving, routing for KV-cache reuse (cache affinity) tends to concentrate popular prefixes on a few nodes and creates hotspots. Pure load-based routing scatters prefixes and forces recomputation. Existing single-mapping schedulers trade one objective for the other and cannot guarantee both low time-to-first-token (TTFT) and even load under real, skewed workloads.

Main Contribution

DualMap: a dual-mapping scheduler that assigns two prefix-bound candidate instances per request and picks one at dispatch time.

SLO-aware routing: prefer cache reuse until expected TTFT would exceed the SLO, then switch to load-aware choice.

Hotspot-aware rebalancing: selective, non-recursive batch migration within each request's two candidates to relieve overloads.

Dual-hash-ring scaling: consistent dual-hash ring that limits remapping during add/remove instance operations.

Implementation and evaluation on real traces with vLLM showing practical latency and capacity gains.

Key Findings

DualMap increases effective request capacity up to 2.25× versus state-of-the-art schedulers on evaluated traces.

Numbersup to 2.25× effective request capacity (abstract, §5)

On the Tool&Agent trace DualMap raised effective request capacity by as much as 125% and goodput by 16.7–48% versus the best baseline.

NumbersEffective capacity +125%; goodput +16.7%–48% (Figures 3b,3d)

DualMap reduces P50 TTFT by 55.4%–97.4% and P90 TTFT by 82.3%–97% versus baselines on evaluated settings.

NumbersP50 ↓55.4%–97.4%; P90 ↓82.3%–97% (Figure 4)

DualMap achieves cache hit rates close to cache-affinity: 62.5% and 96.4% of the theoretical upper bound on Conversation and Tool&Agent.

Numbers62.5% and 96.4% of upper bound (Figures 10a/10b, §A.2.2)

Scheduling overhead is small: per-request routing ~0.6 ms, KV cache queries ~0.2 ms, rebalancing invoked cost 2.2–2.5 ms.

Numbersrouting 0.6 ms; KV queries 0.2 ms; rebalancing 2.2–2.5 ms (A.3.2)

DualMap scales near-linearly in goodput from 8 to 32 instances on the simulated workload.

Numbersnear-linear goodput growth (Figure 13a)

Results

Effective Request Capacity

Valueup to 2.25× vs baselines

Baselinestate-of-the-art schedulers (e.g., Mooncake/Preble)

Goodput

Value+16.7%–48% (Tool&Agent); +14.3%–40% (Conversation)

Baselinebest baseline per trace

P50 TTFT

Valuereduced by 55.4%–97.4%

Baselinebest baseline

P90 TTFT

Valuereduced by 82.3%–97%

Baselinebest baseline

Cache Hit Rate

Value62.5% and 96.4% of theoretical upper bound (Conversation, Tool&Agent)

Baselinetheoretical upper bound / Cache Affinity

Scheduler runtime overhead

Valuerouting ~0.6 ms/request; KV cache query ~0.2 ms; rebalancing 2.2–2.5 ms/invocation

Baselinen/a

Who Should Care

What To Try In 7 Days

Measure prefix-sharing in your traces (shared-prefix rate) to estimate cache opportunities.

Prototype dual-hash mapping for prefixes and track cache hit rate vs queue lengths.

Add an SLO threshold: compute a pending-token TTFT threshold and switch to less-loaded candidate when exceeded.

Optimization Features

Token Efficiency

  • reduces redundant prefill compute via KV-cache reuse
  • lowers pending prefill token counts per instance

Infra Optimization

  • small metadata footprint per instance (≈146 KB for 7B)
  • scheduling ops independent of cluster size

System Optimization

  • improves load balance while keeping cache locality
  • limits remapping during scaling to local regions

Inference Optimization

  • dual-mapping scheduling (two candidate instances)
  • SLO-aware request routing (preserve cache reuse until SLO risk)
  • hotspot-aware batch migration within candidate pair
  • dual-hash-ring consistent scaling

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on accurate TTFT estimation; misestimation under heavy decode bottlenecks can hurt decisions (§A.7).
  • Evaluations use two Mooncake traces and specific Ascend NPU hardware; results may vary on different workloads or hardware.
  • Hotspots can still form; DualMap uses intra-pair migration only, which may not fully solve extreme skew across many hot prefixes.

When Not To Use

  • Workloads with almost no prefix sharing (no KV-cache benefits).
  • Very small clusters where two-choice gains are minimal.
  • Environments where TTFT cannot be estimated or prefill/decode separation is unavailable.

Failure Modes

  • Incorrect TTFT estimates cause wrong switches from cache-aware to load-aware routing, increasing recomputation.
  • Frequent migrations if thresholds are tuned too aggressively, reducing cache reuse and increasing overhead.
  • Extreme and rapidly shifting skew outside the adaptive prefix mechanism's reaction window may still create SLO violations.

Core Entities

Models

  • Qwen2.5-7B
  • Qwen2.5-14B

Metrics

  • Effective Request Capacity
  • Goodput
  • P50 TTFT
  • P90 TTFT
  • P50 E2E
  • P90 E2E
  • Cache Hit Rate
  • Load Balance Ratio (CV)

Datasets

  • Conversation (Mooncake trace)
  • Tool&Agent (Mooncake trace)