Harvest millisecond GPU idle cycles by slicing work into tokens, layers, and tiny KV checkpoints.

Overview

Decision SnapshotNeeds Validation

Evaluated on real hardware (H100 DGX) and on three large models with synthetic and real workloads. Implementation built on vLLM and shows consistent improvements. Missing public code reduces reproducibility.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, Harry Xu

Links

Abstract / PDF

Why It Matters For Business

You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

ConServe is a serving system that co-runs latency-sensitive online LLM requests with latency-tolerant offline batches. It manages work at token, layer, and per-token KV-cache granularity to reclaim millisecond GPU idle cycles. On H100 hardware with Llama-3.1 and Qwen, ConServe keeps online P99 latency near an online-only baseline while increasing offline throughput by roughly 2–3× on evaluated workloads.

Problem Statement

GPU clusters for LLM serving sit idle often because online traffic is bursty. Existing co‑serving or preemption approaches operate at coarse granularity (per-request or per-iteration) and either harm online tail latency or waste offline throughput. The paper asks: can we harvest those idle cycles without violating strict online SLOs?

Main Contribution

SLO-aware token-level scheduler that uses a profiler-based latency model to decide how many offline tokens to add without breaking online TBT/TTFT SLOs.

Layer-wise (sub-iteration) preemption implemented with cheap safepoints between transformer layers to preempt offline work within milliseconds.

Key Findings

ConServe reduces online tail latency while co-serving.

NumbersP99 online latency reduced by up to 2.9× (avg reported in paper)

Practical UseIf you add ConServe, expect substantially lower P99 latencies when mixing offline batches with live traffic; run the provided profiler to tune SLO slack.

Evidence RefAbstract; §6.2

ConServe raises offline throughput when co-serving.

NumbersOffline throughput improved by 2.2× (paper average) and up to 3.0× vs strong preemptive baseline in some tests

Practical UseYou can process more background batch work on the same GPUs without hurting online experience—use ConServe to convert idle capacity into useful offline work.

Evidence RefAbstract; §6.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
P99 online TTFT (tail time to first token)	reduced by up to 2.9× (paper average claim) compared to state-of-the-art co-serving baselines	state-of-the-art preemptive baselines (Sarathi-P, DistServe-P)	2.9× reduction (avg)	evaluated workloads (synthetic + real traces)	Abstract; §6.2	Abstract; §6.2
P99 online TBT (time between tokens)	reduced by up to ~2.7× (reported averages vs baselines)	state-of-the-art preemptive baselines	≈2.7× reduction (avg)	synthetic workloads (§6.2)	§6.2 (reported 2.72× in-text)	§6.2

What To Try In 7 Days

Run the one-time offline profiler on your model/hardware to collect the P vs context grid (paper says ~20 minutes for large models).

Enable token-level admission control: limit offline tokens per iteration using the profiler's can_schedule check.

Instrument safepoints between transformer layers and add incremental token-level KV checkpoints; test preemption latency on a small cluster slice.

Optimization Features

Token Efficiency

token budgeting per iterationdynamic admission of offline tokens

Infra Optimization

overlap device-host KV transfer with computeNVLink/PCIe-aware checkpoint bandwidth utilization

System Optimization

profiler-based latency model (P, C polynomial)safepoints with low-cost host-memory flag readseparate CUDA stream for KV transfers

Inference Optimization

SLO-aware token-level schedulinglayer-wise (sub-iteration) preemptionincremental token-level KV checkpointingbackground KV prefetching

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires instrumenting model layers and modifying the serving engine (paper added ~9k LOC to vLLM).

Needs host memory headroom to checkpoint KV caches efficiently; very tight host memory reduces benefit.

When Not To Use

If you already run offline work on separate, idle clusters and prefer strict separation of workloads.

When host memory is extremely limited and swapping bandwidth is low (tight single-server setups).

Failure Modes

Profiler misprediction admits too many offline tokens, causing SLO violations.

Token-level checkpointing I/O could become a bottleneck on slow PCIe or overloaded host I/O.

Core Entities

Models

Llama-3.1 8BLlama-3.1 70BQwen-2.5 14B

Metrics

P99 TTFT (time to first token)P99 TBT (time between tokens)offline throughput (tokens/s)

Datasets

BurstGPT (online trace)DuReaderMultiNewsVCSUM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ConServe reduces online tail latency while co-serving.

ConServe raises offline throughput when co-serving.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

Key finding

Practical guide to cutting cloud and AI infra costs 28–90% using instance choices, quantization, and FinOps

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

Key finding