Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Better online scheduling of token decoding can cut average latency and GPU-hours: simulations show ~31% lower latency versus a memory-aware baseline, implying fewer GPUs and lower energy costs.
Summary TLDR
This paper models LLM inference with growing KV-cache memory and proposes MC-SF, a fast online batching/scheduling rule that prioritizes in-progress requests and packs short predicted outputs while checking future memory peaks. Theory: no deterministic online alg has constant competitive ratio in adversarial arrivals, but MC-SF is O(1)-competitive under structured cases (identical prompt sizes and bounded prediction error). Empirics: on synthetic traces MC-SF has average latency ratio 1.005 (all-at-once) and 1.047 (online arrivals) vs hindsight-optimal; on 10k real conversations simulated for Llama2-70B, MC-SF averaged 32.11ms vs 46.47ms for a memory-aware benchmark (~31% lower). Caveats: 1)
Problem Statement
LLM decoding builds a KV cache that grows per generated token, causing GPU memory to rise during processing. That growth limits how many prompts/tokens you can batch. Schedulers must decide online (milliseconds) which requests to start so total latency stays low and memory never overflows. Existing heuristics lack theory and can either underutilize memory or cause costly cache clears.
Main Contribution
A compact mathematical model of online batching and scheduling that explicitly tracks per-request KV-cache growth over time.
A hindsight-optimal integer-program benchmark that computes minimum total end-to-end latency under memory constraints.
A negative result: any deterministic online algorithm can have competitive ratio Ω(√n) under adversarial arrivals.
A practical polynomial-time online algorithm, Memory-Constrained Shortest-First (MC-SF), with O(M^2) per-round work and feasibility checks over future peak times.
Proof that MC-SF is O(1)-competitive when all prompts arrive together with identical input sizes and predictions ˜o_i within constant factor of true lengths.
Empirical evaluation: near-optimal performance on synthetic instances and substantial latency reduction on a 10k-subset of real conversational traces simulated for Llama2-70B.
Key Findings
Deterministic online algorithms can perform arbitrarily worse in the worst-case.
MC-SF is nearly optimal when all requests are known at start.
MC-SF reduces average latency vs memory-aware baselines on real conversational traces.
Results
Latency ratio vs hindsight-optimal (all-at-once arrivals)
Latency ratio vs hindsight-optimal (online stochastic arrivals)
Average end-to-end latency (real-trace simulation)
Latency growth slope under high demand
Robustness to prediction noise (with α=0.1 protection)
Who Should Care
What To Try In 7 Days
Run Vidur or a small simulator with your trace and implement MC-SF batching logic to compare latency to your current scheduler.
Add a cheap output-length predictor and reserve a small protection margin (α≈0.1–0.25) to avoid cache overflows.
Tune α and clearing probability β on a holdout trace; the paper suggests α in [0.10,0.25] and β in [0.05,0.25] as good starts.
Optimization Features
Token Efficiency
- prioritize in-progress requests to reduce tail latency
- pack more short responses per batch
Infra Optimization
- single-worker GPU memory modeling (M)
- O(M^2) per-round scheduler complexity
System Optimization
- reserve protection margin to avoid overflow
- non-preemptive scheduling with per-token memory bookkeeping
Inference Optimization
- memory-constrained batching policy
- shortest-predicted-output-first packing
- future-peak memory feasibility checks
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-worker model: multi-GPU/heterogeneous clusters are not covered.
- Algorithm assumes availability of upper-bound output-length predictions or a protection margin.
- Hindsight benchmark (IP) is computationally infeasible at large scale.
- Simulations use a model of Llama2-70B via Vidur; real deployment interactions (network, multi-tenant GPUs) may change results.
When Not To Use
- You have no reliable output-length predictions and cannot reserve protection memory.
- A multi-GPU, heterogeneous deployment requires cross-worker matching not handled here.
- Workloads are highly adversarial and you need worst-case deterministic guarantees (Ω(√n) lower bound applies).
Failure Modes
- Underestimated output lengths lead to KV-cache overflow, causing clears and expensive recomputation.
- Poorly tuned protection α causes either frequent evictions (α too small) or underutilization (α too large).
- Adversarial arrival patterns can make deterministic policies degrade with √n factor.
Core Entities
Models
- Llama2-70B (simulated)
Metrics
- Total end-to-end latency (TEL)
- Average end-to-end latency
- Competitive ratio (algorithm vs hindsight-optimal)
- Per-second throughput
Datasets
- LMSYS-Chat-1M (Zheng et al. 2023) subset of 10,000 conversations
Benchmarks
- Hindsight optimal (Integer Program)
- MC-Benchmark (memory-feasibility baseline)
- α-protection / α-protection β-clearing heuristics
- vLLM/FCFS style baseline

