MC-SF: memory-aware online batching that cuts LLM latency under KV-cache limits

Overview

Decision SnapshotNeeds Validation

The algorithm is polynomial-time and empirically effective on realistic traces, but relies on single-worker assumptions and output-length predictions; deployment needs integration with your inference stack and prediction service.

Citations1

Evidence Strength0.70

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, Zijie Zhou

Links

Abstract / PDF / Data

Why It Matters For Business

Better online scheduling of token decoding can cut average latency and GPU-hours: simulations show ~31% lower latency versus a memory-aware baseline, implying fewer GPUs and lower energy costs.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Data Scientist

Summary TLDR

This paper models LLM inference with growing KV-cache memory and proposes MC-SF, a fast online batching/scheduling rule that prioritizes in-progress requests and packs short predicted outputs while checking future memory peaks. Theory: no deterministic online alg has constant competitive ratio in adversarial arrivals, but MC-SF is O(1)-competitive under structured cases (identical prompt sizes and bounded prediction error). Empirics: on synthetic traces MC-SF has average latency ratio 1.005 (all-at-once) and 1.047 (online arrivals) vs hindsight-optimal; on 10k real conversations simulated for Llama2-70B, MC-SF averaged 32.11ms vs 46.47ms for a memory-aware benchmark (~31% lower). Caveats: 1)

Problem Statement

LLM decoding builds a KV cache that grows per generated token, causing GPU memory to rise during processing. That growth limits how many prompts/tokens you can batch. Schedulers must decide online (milliseconds) which requests to start so total latency stays low and memory never overflows. Existing heuristics lack theory and can either underutilize memory or cause costly cache clears.

Main Contribution

A compact mathematical model of online batching and scheduling that explicitly tracks per-request KV-cache growth over time.

A hindsight-optimal integer-program benchmark that computes minimum total end-to-end latency under memory constraints.

Key Findings

Deterministic online algorithms can perform arbitrarily worse in the worst-case.

Numberscompetitive ratio ≥ Ω(√n) (Theorem 4.1)

Practical UseDon’t expect a single deterministic scheduler to be robust to worst-case adversarial arrival patterns; design must leverage structure or predictions.

Evidence RefSection 4 (Theorem 4.1)

MC-SF is nearly optimal when all requests are known at start.

Numbersaverage latency ratio 1.005; exact optimal in 114/200 trials

Practical UseIf you can batch many arrivals ahead of time or accurately predict arrival sets, MC-SF gives almost offline-optimal latency.

Evidence RefSection 5.1 (Arrival Model 1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Latency ratio vs hindsight-optimal (all-at-once arrivals)	avg 1.005; best 1.000; worst 1.074	Hindsight optimal	≈0.5% above optimal on average	Synthetic Arrival Model 1 (200 trials)	MC-SF nearly matches offline optimum when full request list is known	Section 5.1
Latency ratio vs hindsight-optimal (online stochastic arrivals)	avg 1.047; best 1.000; worst 1.227	Hindsight optimal	≈4.7% above optimal on average	Synthetic Arrival Model 2 (200 trials)	Small gap from information asymmetry under online arrivals	Section 5.1

What To Try In 7 Days

Run Vidur or a small simulator with your trace and implement MC-SF batching logic to compare latency to your current scheduler.

Add a cheap output-length predictor and reserve a small protection margin (α≈0.1–0.25) to avoid cache overflows.

Tune α and clearing probability β on a holdout trace; the paper suggests α in [0.10,0.25] and β in [0.05,0.25] as good starts.

Optimization Features

Token Efficiency

prioritize in-progress requests to reduce tail latencypack more short responses per batch

Infra Optimization

single-worker GPU memory modeling (M)O(M^2) per-round scheduler complexity

System Optimization

reserve protection margin to avoid overflownon-preemptive scheduling with per-token memory bookkeeping

Inference Optimization

memory-constrained batching policyshortest-predicted-output-first packingfuture-peak memory feasibility checks

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/lmsys/lmsys-chat-1m

Risks & Boundaries

Limitations

Single-worker model: multi-GPU/heterogeneous clusters are not covered.

Algorithm assumes availability of upper-bound output-length predictions or a protection margin.

When Not To Use

You have no reliable output-length predictions and cannot reserve protection memory.

A multi-GPU, heterogeneous deployment requires cross-worker matching not handled here.

Failure Modes

Underestimated output lengths lead to KV-cache overflow, causing clears and expensive recomputation.

Poorly tuned protection α causes either frequent evictions (α too small) or underutilization (α too large).

Core Entities

Models

Llama2-70B (simulated)

Metrics

Total end-to-end latency (TEL)Average end-to-end latencyCompetitive ratio (algorithm vs hindsight-optimal)Per-second throughput

Datasets

LMSYS-Chat-1M (Zheng et al. 2023) subset of 10,000 conversations

Benchmarks

Hindsight optimal (Integer Program)MC-Benchmark (memory-feasibility baseline)α-protection / α-protection β-clearing heuristicsvLLM/FCFS style baseline

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Deterministic online algorithms can perform arbitrarily worse in the worst-case.

MC-SF is nearly optimal when all requests are known at start.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding