OrbitFlow adaptively reconfigures per-request KV cache placements to meet token-level latency SLOs for long-context LLM serving

January 5, 20267 min

Overview

Decision SnapshotReady For Pilot

OrbitFlow builds on existing offloading ideas but adds solver-driven per-request placements and practical runtime policies; experiments on LLaMA3 and multi-GPU setups show consistent gains with low overhead, making it ready for staged deployment in services facing long-context workloads.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 70%

Authors

Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo-Yeon Lee, Myeongjae Jeon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OrbitFlow reduces token-level latency violations and raises throughput for long-context LLM services, improving user-perceived responsiveness and allowing more requests per GPU under real workloads.

Who Should Care

Summary TLDR

Long-context LLMs make KV caches grow unpredictably and cause CPU–GPU transfers that spike per-token latency and break SLOs. OrbitFlow uses a lightweight ILP solver to pick per-request, per-layer KV placements, continuously reoptimizes plans during decoding, and adds two runtime mechanisms—Token-Deposit buffering and Pause-Resume—to mask violations and free memory. On ShareGPT-derived traces and LLaMA3 models, OrbitFlow improves TPOT and TBT SLO attainment by 62% and 66%, cuts P95 latency by 38%, and achieves up to 3.3× throughput versus existing offloading systems. Code and artifacts are available.

Problem Statement

Serving long-context LLMs forces growing KV caches that fluctuate with request length and batching. Static, layer-uniform offloading cannot adapt to token- and batch-level drift, causing excessive CPU-to-GPU transfers, stalls, and frequent token-level SLO violations for interactive services.

Main Contribution

OrbitFlow: runtime system that chooses per-request, per-layer KV placements to minimize SLO violations under GPU memory limits.

ILP-based Placement Planner that prunes search space to distance-driven placements and runs one step ahead to hide solver cost.

Key Findings

Solver-driven, per-request placement substantially improves SLO attainment.

NumbersTPOT +62% and TBT +66% SLO attainment (evaluated traces)

Practical UseReplace uniform layer-offload policies with adaptive per-request placement to cut token-level SLO violations in long-context workloads.

Evidence RefAbstract; Sec.5.2; Fig.8

Tail latency and throughput both improve under OrbitFlow.

NumbersP95 latency −38%; throughput up to 3.3× (vs. best baselines)

Practical UseDeploying OrbitFlow can both reduce tail latency and increase completed requests per minute, improving UX and server capacity.

Evidence RefAbstract; Sec.5.2; Fig.8–9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TPOT SLO attainment62% higher (on evaluated traces)existing offloading methodsShareGPT-derived traces, default configAbstract; Sec.5.2; Fig.8Sec.5.2
TBT SLO attainment66% higher (on evaluated traces)existing offloading methodsShareGPT-derived traces, default configAbstract; Sec.5.2; Fig.8Sec.5.2

What To Try In 7 Days

Run OrbitFlow (repo available) on a staging vLLM instance with your longest prompts to compare TBT/TPOT vs your current offloading.

Enable Token-Deposit buffering to smooth output bursts and measure perceived latency improvements.

Experiment with Pause-Resume: identify and defer heavy KV requests to raise overall throughput and lower tail latency.

Optimization Features

Infra Optimization
Multi-GPU tensor-parallel extension with broadcasted placementsSolver run on CPU one step ahead to hide latency
System Optimization
GPU memory packing via fine-grained placementsAsynchronous prefetching and bandwidth-sharing model
Inference Optimization
Per-request, per-layer KV offloadingILP-based placement planningDistance-driven search-space pruningToken-Deposit buffering (token-level smoothing)Pause-Resume request deferralAdaptive reconfiguration based on runtime profiling

Reproducibility

Risks & Boundaries

Limitations

Pause-Resume defers some requests, increasing their end-to-end completion time by design.

Solver restricts placements to evenly spaced (distance-driven) patterns and can miss a small fraction of strictly better layouts.

When Not To Use

If GPU memory is abundant and offloading is unnecessary, static methods are simpler.

If strict per-request fairness prohibits pausing or deferring any requests.

Failure Modes

Solver latency occasionally exceeds the step window and cannot be fully hidden, creating stalls.

Misprofiled bandwidth or compute times lead to suboptimal placements and extra transfers.

Core Entities

Models

LLaMA3-8BLLaMA3-70B

Metrics

TPOTTBTTTFTP95P99ThroughputE2E latency

Datasets

ShareGPT-derived synthetic traces

Context Entities

Models

Grouped-Query Attention (GQA) models referenced (LLaMA3 family)

Metrics

TPOT and TBT SLO attainment used for evaluation

Datasets

ShareGPT (sampling used to synthesize traces)