Harvest millisecond GPU idle cycles by slicing work into tokens, layers, and tiny KV checkpoints.

October 2, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, Harry Xu

Links

Abstract / PDF

Why It Matters For Business

You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.

Summary TLDR

ConServe is a serving system that co-runs latency-sensitive online LLM requests with latency-tolerant offline batches. It manages work at token, layer, and per-token KV-cache granularity to reclaim millisecond GPU idle cycles. On H100 hardware with Llama-3.1 and Qwen, ConServe keeps online P99 latency near an online-only baseline while increasing offline throughput by roughly 2–3× on evaluated workloads.

Problem Statement

GPU clusters for LLM serving sit idle often because online traffic is bursty. Existing co‑serving or preemption approaches operate at coarse granularity (per-request or per-iteration) and either harm online tail latency or waste offline throughput. The paper asks: can we harvest those idle cycles without violating strict online SLOs?

Main Contribution

SLO-aware token-level scheduler that uses a profiler-based latency model to decide how many offline tokens to add without breaking online TBT/TTFT SLOs.

Layer-wise (sub-iteration) preemption implemented with cheap safepoints between transformer layers to preempt offline work within milliseconds.

Incremental token-level KV cache checkpointing and background prefetch to preempt and resume offline work at near-zero cost.

Key Findings

ConServe reduces online tail latency while co-serving.

NumbersP99 online latency reduced by up to 2.9× (avg reported in paper)

ConServe raises offline throughput when co-serving.

NumbersOffline throughput improved by 2.2× (paper average) and up to 3.0× vs strong preemptive baseline in some tests

Token- and layer-granular control makes preemption fast.

NumbersPreemption reacts within ~13ms; safepoint cost ≈21µs (single-GPU) or 167µs (4‑GPU TP)

Incremental KV checkpointing keeps preemption cost low.

NumbersTransferring 2048 tokens takes ≈10ms; per-token transfer capacity >> generation throughput (349K tokens/s calc)

Results

P99 online TTFT (tail time to first token)

Valuereduced by up to 2.9× (paper average claim) compared to state-of-the-art co-serving baselines

Baselinestate-of-the-art preemptive baselines (Sarathi-P, DistServe-P)

P99 online TBT (time between tokens)

Valuereduced by up to ~2.7× (reported averages vs baselines)

Baselinestate-of-the-art preemptive baselines

Offline throughput (tokens/s)

Valueimproved by 2.2× on average (paper summary); in some comparisons 3.0× vs Sarathi-P

BaselineSarathi-Preemptive and other preemptive baselines

Offline throughput vs optimal (Non-Preemptive)

ValueConServe achieves ~78–93% of the optimal offline throughput depending on SLO slack and model

BaselineNon-Preemptive (throughput-optimal) baseline

Who Should Care

What To Try In 7 Days

Run the one-time offline profiler on your model/hardware to collect the P vs context grid (paper says ~20 minutes for large models).

Enable token-level admission control: limit offline tokens per iteration using the profiler's can_schedule check.

Instrument safepoints between transformer layers and add incremental token-level KV checkpoints; test preemption latency on a small cluster slice.

Optimization Features

Token Efficiency

  • token budgeting per iteration
  • dynamic admission of offline tokens

Infra Optimization

  • overlap device-host KV transfer with compute
  • NVLink/PCIe-aware checkpoint bandwidth utilization

System Optimization

  • profiler-based latency model (P, C polynomial)
  • safepoints with low-cost host-memory flag read
  • separate CUDA stream for KV transfers

Inference Optimization

  • SLO-aware token-level scheduling
  • layer-wise (sub-iteration) preemption
  • incremental token-level KV checkpointing
  • background KV prefetching

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires instrumenting model layers and modifying the serving engine (paper added ~9k LOC to vLLM).
  • Needs host memory headroom to checkpoint KV caches efficiently; very tight host memory reduces benefit.
  • Profiler is hardware- and model-specific; miscalibrated models can under/over-allocate offline tokens.
  • Gains shrink for extreme request shapes: very long inputs (>16K) or tiny outputs (<128 tokens) leave little spare capacity.

When Not To Use

  • If you already run offline work on separate, idle clusters and prefer strict separation of workloads.
  • When host memory is extremely limited and swapping bandwidth is low (tight single-server setups).
  • For models or deployments where modifying the serving stack or model graphs is disallowed.

Failure Modes

  • Profiler misprediction admits too many offline tokens, causing SLO violations.
  • Token-level checkpointing I/O could become a bottleneck on slow PCIe or overloaded host I/O.
  • Tensor-parallel synchronization issues if safepoints are not correctly broadcast across workers.
  • Background prefetching may not keep up under heavy eviction patterns, forcing recomputation.

Core Entities

Models

  • Llama-3.1 8B
  • Llama-3.1 70B
  • Qwen-2.5 14B

Metrics

  • P99 TTFT (time to first token)
  • P99 TBT (time between tokens)
  • offline throughput (tokens/s)

Datasets

  • BurstGPT (online trace)
  • DuReader
  • MultiNews
  • VCSUM