Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
OrbitFlow reduces token-level latency violations and raises throughput for long-context LLM services, improving user-perceived responsiveness and allowing more requests per GPU under real workloads.
Summary TLDR
Long-context LLMs make KV caches grow unpredictably and cause CPU–GPU transfers that spike per-token latency and break SLOs. OrbitFlow uses a lightweight ILP solver to pick per-request, per-layer KV placements, continuously reoptimizes plans during decoding, and adds two runtime mechanisms—Token-Deposit buffering and Pause-Resume—to mask violations and free memory. On ShareGPT-derived traces and LLaMA3 models, OrbitFlow improves TPOT and TBT SLO attainment by 62% and 66%, cuts P95 latency by 38%, and achieves up to 3.3× throughput versus existing offloading systems. Code and artifacts are available.
Problem Statement
Serving long-context LLMs forces growing KV caches that fluctuate with request length and batching. Static, layer-uniform offloading cannot adapt to token- and batch-level drift, causing excessive CPU-to-GPU transfers, stalls, and frequent token-level SLO violations for interactive services.
Main Contribution
OrbitFlow: runtime system that chooses per-request, per-layer KV placements to minimize SLO violations under GPU memory limits.
ILP-based Placement Planner that prunes search space to distance-driven placements and runs one step ahead to hide solver cost.
Two runtime mechanisms: Token-Deposit (buffer tokens and release at SLO rate) and Pause-Resume (defer heavy requests) to preserve SLOs.
A practical implementation on vLLM with extensions for tensor-parallel multi-GPU serving and public artifacts on GitHub.
Key Findings
Solver-driven, per-request placement substantially improves SLO attainment.
Tail latency and throughput both improve under OrbitFlow.
Runtime overhead of OrbitFlow is small.
Token-Deposit and Pause-Resume materially contribute to SLO gains.
Results
TPOT SLO attainment
TBT SLO attainment
P95 latency
Throughput
Solver runtime overhead
Who Should Care
What To Try In 7 Days
Run OrbitFlow (repo available) on a staging vLLM instance with your longest prompts to compare TBT/TPOT vs your current offloading.
Enable Token-Deposit buffering to smooth output bursts and measure perceived latency improvements.
Experiment with Pause-Resume: identify and defer heavy KV requests to raise overall throughput and lower tail latency.
Optimization Features
Infra Optimization
- Multi-GPU tensor-parallel extension with broadcasted placements
- Solver run on CPU one step ahead to hide latency
System Optimization
- GPU memory packing via fine-grained placements
- Asynchronous prefetching and bandwidth-sharing model
Inference Optimization
- Per-request, per-layer KV offloading
- ILP-based placement planning
- Distance-driven search-space pruning
- Token-Deposit buffering (token-level smoothing)
- Pause-Resume request deferral
- Adaptive reconfiguration based on runtime profiling
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Pause-Resume defers some requests, increasing their end-to-end completion time by design.
- Solver restricts placements to evenly spaced (distance-driven) patterns and can miss a small fraction of strictly better layouts.
- Evaluation uses ShareGPT-derived synthetic traces; real production traces could reveal different batch dynamics.
- Solver can occasionally exceed one decode step in very volatile workloads, causing visible overhead in rare cases.
When Not To Use
- If GPU memory is abundant and offloading is unnecessary, static methods are simpler.
- If strict per-request fairness prohibits pausing or deferring any requests.
- When workloads are always short-context so KV growth is negligible.
Failure Modes
- Solver latency occasionally exceeds the step window and cannot be fully hidden, creating stalls.
- Misprofiled bandwidth or compute times lead to suboptimal placements and extra transfers.
- Pause-Resume choices may harm user experience for paused requests if not tuned.
Core Entities
Models
- LLaMA3-8B
- LLaMA3-70B
Metrics
- TPOT
- TBT
- TTFT
- P95
- P99
- Throughput
- E2E latency
Datasets
- ShareGPT-derived synthetic traces
Context Entities
Models
- Grouped-Query Attention (GQA) models referenced (LLaMA3 family)
Metrics
- TPOT and TBT SLO attainment used for evaluation
Datasets
- ShareGPT (sampling used to synthesize traces)

