MC-SF: memory-aware online batching that cuts LLM latency under KV-cache limits

February 10, 20258 min

Overview

Decision SnapshotNeeds Validation

The algorithm is polynomial-time and empirically effective on realistic traces, but relies on single-worker assumptions and output-length predictions; deployment needs integration with your inference stack and prediction service.

Citations1

Evidence Strength0.70

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, Zijie Zhou

Links

Abstract / PDF / Data

Why It Matters For Business

Better online scheduling of token decoding can cut average latency and GPU-hours: simulations show ~31% lower latency versus a memory-aware baseline, implying fewer GPUs and lower energy costs.

Who Should Care

Summary TLDR

This paper models LLM inference with growing KV-cache memory and proposes MC-SF, a fast online batching/scheduling rule that prioritizes in-progress requests and packs short predicted outputs while checking future memory peaks. Theory: no deterministic online alg has constant competitive ratio in adversarial arrivals, but MC-SF is O(1)-competitive under structured cases (identical prompt sizes and bounded prediction error). Empirics: on synthetic traces MC-SF has average latency ratio 1.005 (all-at-once) and 1.047 (online arrivals) vs hindsight-optimal; on 10k real conversations simulated for Llama2-70B, MC-SF averaged 32.11ms vs 46.47ms for a memory-aware benchmark (~31% lower). Caveats: 1)

Problem Statement

LLM decoding builds a KV cache that grows per generated token, causing GPU memory to rise during processing. That growth limits how many prompts/tokens you can batch. Schedulers must decide online (milliseconds) which requests to start so total latency stays low and memory never overflows. Existing heuristics lack theory and can either underutilize memory or cause costly cache clears.

Main Contribution

A compact mathematical model of online batching and scheduling that explicitly tracks per-request KV-cache growth over time.

A hindsight-optimal integer-program benchmark that computes minimum total end-to-end latency under memory constraints.

Key Findings

Deterministic online algorithms can perform arbitrarily worse in the worst-case.

Numberscompetitive ratio ≥ Ω(√n) (Theorem 4.1)

Practical UseDon’t expect a single deterministic scheduler to be robust to worst-case adversarial arrival patterns; design must leverage structure or predictions.

Evidence RefSection 4 (Theorem 4.1)

MC-SF is nearly optimal when all requests are known at start.

Numbersaverage latency ratio 1.005; exact optimal in 114/200 trials

Practical UseIf you can batch many arrivals ahead of time or accurately predict arrival sets, MC-SF gives almost offline-optimal latency.

Evidence RefSection 5.1 (Arrival Model 1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Latency ratio vs hindsight-optimal (all-at-once arrivals)avg 1.005; best 1.000; worst 1.074Hindsight optimal≈0.5% above optimal on averageSynthetic Arrival Model 1 (200 trials)MC-SF nearly matches offline optimum when full request list is knownSection 5.1
Latency ratio vs hindsight-optimal (online stochastic arrivals)avg 1.047; best 1.000; worst 1.227Hindsight optimal≈4.7% above optimal on averageSynthetic Arrival Model 2 (200 trials)Small gap from information asymmetry under online arrivalsSection 5.1

What To Try In 7 Days

Run Vidur or a small simulator with your trace and implement MC-SF batching logic to compare latency to your current scheduler.

Add a cheap output-length predictor and reserve a small protection margin (α≈0.1–0.25) to avoid cache overflows.

Tune α and clearing probability β on a holdout trace; the paper suggests α in [0.10,0.25] and β in [0.05,0.25] as good starts.

Optimization Features

Token Efficiency
prioritize in-progress requests to reduce tail latencypack more short responses per batch
Infra Optimization
single-worker GPU memory modeling (M)O(M^2) per-round scheduler complexity
System Optimization
reserve protection margin to avoid overflownon-preemptive scheduling with per-token memory bookkeeping
Inference Optimization
memory-constrained batching policyshortest-predicted-output-first packingfuture-peak memory feasibility checks

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single-worker model: multi-GPU/heterogeneous clusters are not covered.

Algorithm assumes availability of upper-bound output-length predictions or a protection margin.

When Not To Use

You have no reliable output-length predictions and cannot reserve protection memory.

A multi-GPU, heterogeneous deployment requires cross-worker matching not handled here.

Failure Modes

Underestimated output lengths lead to KV-cache overflow, causing clears and expensive recomputation.

Poorly tuned protection α causes either frequent evictions (α too small) or underutilization (α too large).

Core Entities

Models

Llama2-70B (simulated)

Metrics

Total end-to-end latency (TEL)Average end-to-end latencyCompetitive ratio (algorithm vs hindsight-optimal)Per-second throughput

Datasets

LMSYS-Chat-1M (Zheng et al. 2023) subset of 10,000 conversations

Benchmarks

Hindsight optimal (Integer Program)MC-Benchmark (memory-feasibility baseline)α-protection / α-protection β-clearing heuristicsvLLM/FCFS style baseline