Overview
OrbitFlow builds on existing offloading ideas but adds solver-driven per-request placements and practical runtime policies; experiments on LLaMA3 and multi-GPU setups show consistent gains with low overhead, making it ready for staged deployment in services facing long-context workloads.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 80%
Novelty: 70%
Why It Matters For Business
OrbitFlow reduces token-level latency violations and raises throughput for long-context LLM services, improving user-perceived responsiveness and allowing more requests per GPU under real workloads.
Who Should Care
Summary TLDR
Long-context LLMs make KV caches grow unpredictably and cause CPU–GPU transfers that spike per-token latency and break SLOs. OrbitFlow uses a lightweight ILP solver to pick per-request, per-layer KV placements, continuously reoptimizes plans during decoding, and adds two runtime mechanisms—Token-Deposit buffering and Pause-Resume—to mask violations and free memory. On ShareGPT-derived traces and LLaMA3 models, OrbitFlow improves TPOT and TBT SLO attainment by 62% and 66%, cuts P95 latency by 38%, and achieves up to 3.3× throughput versus existing offloading systems. Code and artifacts are available.
Problem Statement
Serving long-context LLMs forces growing KV caches that fluctuate with request length and batching. Static, layer-uniform offloading cannot adapt to token- and batch-level drift, causing excessive CPU-to-GPU transfers, stalls, and frequent token-level SLO violations for interactive services.
Main Contribution
OrbitFlow: runtime system that chooses per-request, per-layer KV placements to minimize SLO violations under GPU memory limits.
ILP-based Placement Planner that prunes search space to distance-driven placements and runs one step ahead to hide solver cost.
Key Findings
Solver-driven, per-request placement substantially improves SLO attainment.
Tail latency and throughput both improve under OrbitFlow.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| TPOT SLO attainment | 62% higher (on evaluated traces) | existing offloading methods | — | ShareGPT-derived traces, default config | Abstract; Sec.5.2; Fig.8 | Sec.5.2 |
| TBT SLO attainment | 66% higher (on evaluated traces) | existing offloading methods | — | ShareGPT-derived traces, default config | Abstract; Sec.5.2; Fig.8 | Sec.5.2 |
What To Try In 7 Days
Run OrbitFlow (repo available) on a staging vLLM instance with your longest prompts to compare TBT/TPOT vs your current offloading.
Enable Token-Deposit buffering to smooth output bursts and measure perceived latency improvements.
Experiment with Pause-Resume: identify and defer heavy KV requests to raise overall throughput and lower tail latency.
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Pause-Resume defers some requests, increasing their end-to-end completion time by design.
Solver restricts placements to evenly spaced (distance-driven) patterns and can miss a small fraction of strictly better layouts.
When Not To Use
If GPU memory is abundant and offloading is unnecessary, static methods are simpler.
If strict per-request fairness prohibits pausing or deferring any requests.
Failure Modes
Solver latency occasionally exceeds the step window and cannot be fully hidden, creating stalls.
Misprofiled bandwidth or compute times lead to suboptimal placements and extra transfers.

