OrbitFlow adaptively reconfigures per-request KV cache placements to meet token-level latency SLOs for long-context LLM serving

January 5, 20267 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo-Yeon Lee, Myeongjae Jeon

Links

Abstract / PDF

Why It Matters For Business

OrbitFlow reduces token-level latency violations and raises throughput for long-context LLM services, improving user-perceived responsiveness and allowing more requests per GPU under real workloads.

Summary TLDR

Long-context LLMs make KV caches grow unpredictably and cause CPU–GPU transfers that spike per-token latency and break SLOs. OrbitFlow uses a lightweight ILP solver to pick per-request, per-layer KV placements, continuously reoptimizes plans during decoding, and adds two runtime mechanisms—Token-Deposit buffering and Pause-Resume—to mask violations and free memory. On ShareGPT-derived traces and LLaMA3 models, OrbitFlow improves TPOT and TBT SLO attainment by 62% and 66%, cuts P95 latency by 38%, and achieves up to 3.3× throughput versus existing offloading systems. Code and artifacts are available.

Problem Statement

Serving long-context LLMs forces growing KV caches that fluctuate with request length and batching. Static, layer-uniform offloading cannot adapt to token- and batch-level drift, causing excessive CPU-to-GPU transfers, stalls, and frequent token-level SLO violations for interactive services.

Main Contribution

OrbitFlow: runtime system that chooses per-request, per-layer KV placements to minimize SLO violations under GPU memory limits.

ILP-based Placement Planner that prunes search space to distance-driven placements and runs one step ahead to hide solver cost.

Two runtime mechanisms: Token-Deposit (buffer tokens and release at SLO rate) and Pause-Resume (defer heavy requests) to preserve SLOs.

A practical implementation on vLLM with extensions for tensor-parallel multi-GPU serving and public artifacts on GitHub.

Key Findings

Solver-driven, per-request placement substantially improves SLO attainment.

NumbersTPOT +62% and TBT +66% SLO attainment (evaluated traces)

Tail latency and throughput both improve under OrbitFlow.

NumbersP95 latency −38%; throughput up to 3.3× (vs. best baselines)

Runtime overhead of OrbitFlow is small.

NumbersPlacement Planner adds <1% of end-to-end time in dynamic traces

Token-Deposit and Pause-Resume materially contribute to SLO gains.

NumbersIncremental SLO attainment: +Token-Deposit raises final attainment from 71.0% to 85.6% in ablation

Results

TPOT SLO attainment

Value62% higher (on evaluated traces)

Baselineexisting offloading methods

TBT SLO attainment

Value66% higher (on evaluated traces)

Baselineexisting offloading methods

P95 latency

Value38% reduction (best competing methods)

Baselinebest competing offloading methods

Throughput

ValueUp to 3.3× higher

BaselineDeepSpeed-Inference and other baselines

Solver runtime overhead

Value<1% of end-to-end time (in dynamic traces)

Who Should Care

What To Try In 7 Days

Run OrbitFlow (repo available) on a staging vLLM instance with your longest prompts to compare TBT/TPOT vs your current offloading.

Enable Token-Deposit buffering to smooth output bursts and measure perceived latency improvements.

Experiment with Pause-Resume: identify and defer heavy KV requests to raise overall throughput and lower tail latency.

Optimization Features

Infra Optimization

  • Multi-GPU tensor-parallel extension with broadcasted placements
  • Solver run on CPU one step ahead to hide latency

System Optimization

  • GPU memory packing via fine-grained placements
  • Asynchronous prefetching and bandwidth-sharing model

Inference Optimization

  • Per-request, per-layer KV offloading
  • ILP-based placement planning
  • Distance-driven search-space pruning
  • Token-Deposit buffering (token-level smoothing)
  • Pause-Resume request deferral
  • Adaptive reconfiguration based on runtime profiling

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Pause-Resume defers some requests, increasing their end-to-end completion time by design.
  • Solver restricts placements to evenly spaced (distance-driven) patterns and can miss a small fraction of strictly better layouts.
  • Evaluation uses ShareGPT-derived synthetic traces; real production traces could reveal different batch dynamics.
  • Solver can occasionally exceed one decode step in very volatile workloads, causing visible overhead in rare cases.

When Not To Use

  • If GPU memory is abundant and offloading is unnecessary, static methods are simpler.
  • If strict per-request fairness prohibits pausing or deferring any requests.
  • When workloads are always short-context so KV growth is negligible.

Failure Modes

  • Solver latency occasionally exceeds the step window and cannot be fully hidden, creating stalls.
  • Misprofiled bandwidth or compute times lead to suboptimal placements and extra transfers.
  • Pause-Resume choices may harm user experience for paused requests if not tuned.

Core Entities

Models

  • LLaMA3-8B
  • LLaMA3-70B

Metrics

  • TPOT
  • TBT
  • TTFT
  • P95
  • P99
  • Throughput
  • E2E latency

Datasets

  • ShareGPT-derived synthetic traces

Context Entities

Models

  • Grouped-Query Attention (GQA) models referenced (LLaMA3 family)

Metrics

  • TPOT and TBT SLO attainment used for evaluation

Datasets

  • ShareGPT (sampling used to synthesize traces)