Harvest millisecond GPU idle cycles by slicing work into tokens, layers, and tiny KV checkpoints.

October 2, 20247 min

Overview

Decision SnapshotNeeds Validation

Evaluated on real hardware (H100 DGX) and on three large models with synthetic and real workloads. Implementation built on vLLM and shows consistent improvements. Missing public code reduces reproducibility.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, Harry Xu

Links

Abstract / PDF

Why It Matters For Business

You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.

Who Should Care

Summary TLDR

ConServe is a serving system that co-runs latency-sensitive online LLM requests with latency-tolerant offline batches. It manages work at token, layer, and per-token KV-cache granularity to reclaim millisecond GPU idle cycles. On H100 hardware with Llama-3.1 and Qwen, ConServe keeps online P99 latency near an online-only baseline while increasing offline throughput by roughly 2–3× on evaluated workloads.

Problem Statement

GPU clusters for LLM serving sit idle often because online traffic is bursty. Existing co‑serving or preemption approaches operate at coarse granularity (per-request or per-iteration) and either harm online tail latency or waste offline throughput. The paper asks: can we harvest those idle cycles without violating strict online SLOs?

Main Contribution

SLO-aware token-level scheduler that uses a profiler-based latency model to decide how many offline tokens to add without breaking online TBT/TTFT SLOs.

Layer-wise (sub-iteration) preemption implemented with cheap safepoints between transformer layers to preempt offline work within milliseconds.

Key Findings

ConServe reduces online tail latency while co-serving.

NumbersP99 online latency reduced by up to 2.9× (avg reported in paper)

Practical UseIf you add ConServe, expect substantially lower P99 latencies when mixing offline batches with live traffic; run the provided profiler to tune SLO slack.

Evidence RefAbstract; §6.2

ConServe raises offline throughput when co-serving.

NumbersOffline throughput improved by 2.2× (paper average) and up to 3.0× vs strong preemptive baseline in some tests

Practical UseYou can process more background batch work on the same GPUs without hurting online experience—use ConServe to convert idle capacity into useful offline work.

Evidence RefAbstract; §6.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
P99 online TTFT (tail time to first token)reduced by up to 2.9× (paper average claim) compared to state-of-the-art co-serving baselinesstate-of-the-art preemptive baselines (Sarathi-P, DistServe-P)2.9× reduction (avg)evaluated workloads (synthetic + real traces)Abstract; §6.2Abstract; §6.2
P99 online TBT (time between tokens)reduced by up to ~2.7× (reported averages vs baselines)state-of-the-art preemptive baselines≈2.7× reduction (avg)synthetic workloads (§6.2)§6.2 (reported 2.72× in-text)§6.2

What To Try In 7 Days

Run the one-time offline profiler on your model/hardware to collect the P vs context grid (paper says ~20 minutes for large models).

Enable token-level admission control: limit offline tokens per iteration using the profiler's can_schedule check.

Instrument safepoints between transformer layers and add incremental token-level KV checkpoints; test preemption latency on a small cluster slice.

Optimization Features

Token Efficiency
token budgeting per iterationdynamic admission of offline tokens
Infra Optimization
overlap device-host KV transfer with computeNVLink/PCIe-aware checkpoint bandwidth utilization
System Optimization
profiler-based latency model (P, C polynomial)safepoints with low-cost host-memory flag readseparate CUDA stream for KV transfers
Inference Optimization
SLO-aware token-level schedulinglayer-wise (sub-iteration) preemptionincremental token-level KV checkpointingbackground KV prefetching

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Requires instrumenting model layers and modifying the serving engine (paper added ~9k LOC to vLLM).

Needs host memory headroom to checkpoint KV caches efficiently; very tight host memory reduces benefit.

When Not To Use

If you already run offline work on separate, idle clusters and prefer strict separation of workloads.

When host memory is extremely limited and swapping bandwidth is low (tight single-server setups).

Failure Modes

Profiler misprediction admits too many offline tokens, causing SLO violations.

Token-level checkpointing I/O could become a bottleneck on slow PCIe or overloaded host I/O.

Core Entities

Models

Llama-3.1 8BLlama-3.1 70BQwen-2.5 14B

Metrics

P99 TTFT (time to first token)P99 TBT (time between tokens)offline throughput (tokens/s)

Datasets

BurstGPT (online trace)DuReaderMultiNewsVCSUM