MC-SF: memory-aware online batching that cuts LLM latency under KV-cache limits

February 10, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, Zijie Zhou

Links

Abstract / PDF

Why It Matters For Business

Better online scheduling of token decoding can cut average latency and GPU-hours: simulations show ~31% lower latency versus a memory-aware baseline, implying fewer GPUs and lower energy costs.

Summary TLDR

This paper models LLM inference with growing KV-cache memory and proposes MC-SF, a fast online batching/scheduling rule that prioritizes in-progress requests and packs short predicted outputs while checking future memory peaks. Theory: no deterministic online alg has constant competitive ratio in adversarial arrivals, but MC-SF is O(1)-competitive under structured cases (identical prompt sizes and bounded prediction error). Empirics: on synthetic traces MC-SF has average latency ratio 1.005 (all-at-once) and 1.047 (online arrivals) vs hindsight-optimal; on 10k real conversations simulated for Llama2-70B, MC-SF averaged 32.11ms vs 46.47ms for a memory-aware benchmark (~31% lower). Caveats: 1)

Problem Statement

LLM decoding builds a KV cache that grows per generated token, causing GPU memory to rise during processing. That growth limits how many prompts/tokens you can batch. Schedulers must decide online (milliseconds) which requests to start so total latency stays low and memory never overflows. Existing heuristics lack theory and can either underutilize memory or cause costly cache clears.

Main Contribution

A compact mathematical model of online batching and scheduling that explicitly tracks per-request KV-cache growth over time.

A hindsight-optimal integer-program benchmark that computes minimum total end-to-end latency under memory constraints.

A negative result: any deterministic online algorithm can have competitive ratio Ω(√n) under adversarial arrivals.

A practical polynomial-time online algorithm, Memory-Constrained Shortest-First (MC-SF), with O(M^2) per-round work and feasibility checks over future peak times.

Proof that MC-SF is O(1)-competitive when all prompts arrive together with identical input sizes and predictions ˜o_i within constant factor of true lengths.

Empirical evaluation: near-optimal performance on synthetic instances and substantial latency reduction on a 10k-subset of real conversational traces simulated for Llama2-70B.

Key Findings

Deterministic online algorithms can perform arbitrarily worse in the worst-case.

Numberscompetitive ratio ≥ Ω(√n) (Theorem 4.1)

MC-SF is nearly optimal when all requests are known at start.

Numbersaverage latency ratio 1.005; exact optimal in 114/200 trials

MC-SF reduces average latency vs memory-aware baselines on real conversational traces.

Numbersavg latency 32.112 vs 46.472 (MC-Benchmark) -> ~31% lower

Results

Latency ratio vs hindsight-optimal (all-at-once arrivals)

Valueavg 1.005; best 1.000; worst 1.074

BaselineHindsight optimal

Latency ratio vs hindsight-optimal (online stochastic arrivals)

Valueavg 1.047; best 1.000; worst 1.227

BaselineHindsight optimal

Average end-to-end latency (real-trace simulation)

ValueMC-SF 32.112 vs MC-Benchmark 46.472 (units in paper table)

BaselineMC-Benchmark

Latency growth slope under high demand

ValueMC-SF slope ≈ 1/6 vs best baseline ≈ 1/2

Baselinebest-performing benchmarks

Robustness to prediction noise (with α=0.1 protection)

ValueMC-SF remains below FCFS baseline even at ϵ=0.8 prediction error

BaselineFCFS / MC-Benchmark

Who Should Care

What To Try In 7 Days

Run Vidur or a small simulator with your trace and implement MC-SF batching logic to compare latency to your current scheduler.

Add a cheap output-length predictor and reserve a small protection margin (α≈0.1–0.25) to avoid cache overflows.

Tune α and clearing probability β on a holdout trace; the paper suggests α in [0.10,0.25] and β in [0.05,0.25] as good starts.

Optimization Features

Token Efficiency

  • prioritize in-progress requests to reduce tail latency
  • pack more short responses per batch

Infra Optimization

  • single-worker GPU memory modeling (M)
  • O(M^2) per-round scheduler complexity

System Optimization

  • reserve protection margin to avoid overflow
  • non-preemptive scheduling with per-token memory bookkeeping

Inference Optimization

  • memory-constrained batching policy
  • shortest-predicted-output-first packing
  • future-peak memory feasibility checks

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-worker model: multi-GPU/heterogeneous clusters are not covered.
  • Algorithm assumes availability of upper-bound output-length predictions or a protection margin.
  • Hindsight benchmark (IP) is computationally infeasible at large scale.
  • Simulations use a model of Llama2-70B via Vidur; real deployment interactions (network, multi-tenant GPUs) may change results.

When Not To Use

  • You have no reliable output-length predictions and cannot reserve protection memory.
  • A multi-GPU, heterogeneous deployment requires cross-worker matching not handled here.
  • Workloads are highly adversarial and you need worst-case deterministic guarantees (Ω(√n) lower bound applies).

Failure Modes

  • Underestimated output lengths lead to KV-cache overflow, causing clears and expensive recomputation.
  • Poorly tuned protection α causes either frequent evictions (α too small) or underutilization (α too large).
  • Adversarial arrival patterns can make deterministic policies degrade with √n factor.

Core Entities

Models

  • Llama2-70B (simulated)

Metrics

  • Total end-to-end latency (TEL)
  • Average end-to-end latency
  • Competitive ratio (algorithm vs hindsight-optimal)
  • Per-second throughput

Datasets

  • LMSYS-Chat-1M (Zheng et al. 2023) subset of 10,000 conversations

Benchmarks

  • Hindsight optimal (Integer Program)
  • MC-Benchmark (memory-feasibility baseline)
  • α-protection / α-protection β-clearing heuristics
  • vLLM/FCFS style baseline