OrbitFlow adaptively reconfigures per-request KV cache placements to meet token-level latency SLOs for long-context LLM serving

Overview

Decision SnapshotReady For Pilot

OrbitFlow builds on existing offloading ideas but adds solver-driven per-request placements and practical runtime policies; experiments on LLaMA3 and multi-GPU setups show consistent gains with low overhead, making it ready for staged deployment in services facing long-context workloads.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 70%

Authors

Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo-Yeon Lee, Myeongjae Jeon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OrbitFlow reduces token-level latency violations and raises throughput for long-context LLM services, improving user-perceived responsiveness and allowing more requests per GPU under real workloads.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

Long-context LLMs make KV caches grow unpredictably and cause CPU–GPU transfers that spike per-token latency and break SLOs. OrbitFlow uses a lightweight ILP solver to pick per-request, per-layer KV placements, continuously reoptimizes plans during decoding, and adds two runtime mechanisms—Token-Deposit buffering and Pause-Resume—to mask violations and free memory. On ShareGPT-derived traces and LLaMA3 models, OrbitFlow improves TPOT and TBT SLO attainment by 62% and 66%, cuts P95 latency by 38%, and achieves up to 3.3× throughput versus existing offloading systems. Code and artifacts are available.

Problem Statement

Serving long-context LLMs forces growing KV caches that fluctuate with request length and batching. Static, layer-uniform offloading cannot adapt to token- and batch-level drift, causing excessive CPU-to-GPU transfers, stalls, and frequent token-level SLO violations for interactive services.

Main Contribution

OrbitFlow: runtime system that chooses per-request, per-layer KV placements to minimize SLO violations under GPU memory limits.

ILP-based Placement Planner that prunes search space to distance-driven placements and runs one step ahead to hide solver cost.

Key Findings

Solver-driven, per-request placement substantially improves SLO attainment.

NumbersTPOT +62% and TBT +66% SLO attainment (evaluated traces)

Practical UseReplace uniform layer-offload policies with adaptive per-request placement to cut token-level SLO violations in long-context workloads.

Evidence RefAbstract; Sec.5.2; Fig.8

Tail latency and throughput both improve under OrbitFlow.

NumbersP95 latency −38%; throughput up to 3.3× (vs. best baselines)

Practical UseDeploying OrbitFlow can both reduce tail latency and increase completed requests per minute, improving UX and server capacity.

Evidence RefAbstract; Sec.5.2; Fig.8–9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TPOT SLO attainment	62% higher (on evaluated traces)	existing offloading methods	—	ShareGPT-derived traces, default config	Abstract; Sec.5.2; Fig.8	Sec.5.2
TBT SLO attainment	66% higher (on evaluated traces)	existing offloading methods	—	ShareGPT-derived traces, default config	Abstract; Sec.5.2; Fig.8	Sec.5.2

What To Try In 7 Days

Run OrbitFlow (repo available) on a staging vLLM instance with your longest prompts to compare TBT/TPOT vs your current offloading.

Enable Token-Deposit buffering to smooth output bursts and measure perceived latency improvements.

Experiment with Pause-Resume: identify and defer heavy KV requests to raise overall throughput and lower tail latency.

Optimization Features

Infra Optimization

Multi-GPU tensor-parallel extension with broadcasted placementsSolver run on CPU one step ahead to hide latency

System Optimization

GPU memory packing via fine-grained placementsAsynchronous prefetching and bandwidth-sharing model

Inference Optimization

Per-request, per-layer KV offloadingILP-based placement planningDistance-driven search-space pruningToken-Deposit buffering (token-level smoothing)Pause-Resume request deferralAdaptive reconfiguration based on runtime profiling

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/omnia-postech/OrbitFlow

Data URLs

https://github.com/omnia-postech/OrbitFlow (traces and scripts based on ShareGPT-derived synthetic workloads)

Risks & Boundaries

Limitations

Pause-Resume defers some requests, increasing their end-to-end completion time by design.

Solver restricts placements to evenly spaced (distance-driven) patterns and can miss a small fraction of strictly better layouts.

When Not To Use

If GPU memory is abundant and offloading is unnecessary, static methods are simpler.

If strict per-request fairness prohibits pausing or deferring any requests.

Failure Modes

Solver latency occasionally exceeds the step window and cannot be fully hidden, creating stalls.

Misprofiled bandwidth or compute times lead to suboptimal placements and extra transfers.

Core Entities

Models

LLaMA3-8BLLaMA3-70B

Metrics

TPOTTBTTTFTP95P99ThroughputE2E latency

Datasets

ShareGPT-derived synthetic traces

Context Entities

Models

Grouped-Query Attention (GQA) models referenced (LLaMA3 family)

Metrics

TPOT and TBT SLO attainment used for evaluation

Datasets

ShareGPT (sampling used to synthesize traces)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Solver-driven, per-request placement substantially improves SLO attainment.

Tail latency and throughput both improve under OrbitFlow.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding