Overview
The algorithm is polynomial-time and empirically effective on realistic traces, but relies on single-worker assumptions and output-length predictions; deployment needs integration with your inference stack and prediction service.
Citations1
Evidence Strength0.70
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Better online scheduling of token decoding can cut average latency and GPU-hours: simulations show ~31% lower latency versus a memory-aware baseline, implying fewer GPUs and lower energy costs.
Who Should Care
Summary TLDR
This paper models LLM inference with growing KV-cache memory and proposes MC-SF, a fast online batching/scheduling rule that prioritizes in-progress requests and packs short predicted outputs while checking future memory peaks. Theory: no deterministic online alg has constant competitive ratio in adversarial arrivals, but MC-SF is O(1)-competitive under structured cases (identical prompt sizes and bounded prediction error). Empirics: on synthetic traces MC-SF has average latency ratio 1.005 (all-at-once) and 1.047 (online arrivals) vs hindsight-optimal; on 10k real conversations simulated for Llama2-70B, MC-SF averaged 32.11ms vs 46.47ms for a memory-aware benchmark (~31% lower). Caveats: 1)
Problem Statement
LLM decoding builds a KV cache that grows per generated token, causing GPU memory to rise during processing. That growth limits how many prompts/tokens you can batch. Schedulers must decide online (milliseconds) which requests to start so total latency stays low and memory never overflows. Existing heuristics lack theory and can either underutilize memory or cause costly cache clears.
Main Contribution
A compact mathematical model of online batching and scheduling that explicitly tracks per-request KV-cache growth over time.
A hindsight-optimal integer-program benchmark that computes minimum total end-to-end latency under memory constraints.
Key Findings
Deterministic online algorithms can perform arbitrarily worse in the worst-case.
MC-SF is nearly optimal when all requests are known at start.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Latency ratio vs hindsight-optimal (all-at-once arrivals) | avg 1.005; best 1.000; worst 1.074 | Hindsight optimal | ≈0.5% above optimal on average | Synthetic Arrival Model 1 (200 trials) | MC-SF nearly matches offline optimum when full request list is known | Section 5.1 |
| Latency ratio vs hindsight-optimal (online stochastic arrivals) | avg 1.047; best 1.000; worst 1.227 | Hindsight optimal | ≈4.7% above optimal on average | Synthetic Arrival Model 2 (200 trials) | Small gap from information asymmetry under online arrivals | Section 5.1 |
What To Try In 7 Days
Run Vidur or a small simulator with your trace and implement MC-SF batching logic to compare latency to your current scheduler.
Add a cheap output-length predictor and reserve a small protection margin (α≈0.1–0.25) to avoid cache overflows.
Tune α and clearing probability β on a holdout trace; the paper suggests α in [0.10,0.25] and β in [0.05,0.25] as good starts.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Single-worker model: multi-GPU/heterogeneous clusters are not covered.
Algorithm assumes availability of upper-bound output-length predictions or a protection margin.
When Not To Use
You have no reliable output-length predictions and cannot reserve protection memory.
A multi-GPU, heterogeneous deployment requires cross-worker matching not handled here.
Failure Modes
Underestimated output lengths lead to KV-cache overflow, causing clears and expensive recomputation.
Poorly tuned protection α causes either frequent evictions (α too small) or underutilization (α too large).

