Overview
Evaluated on real hardware (H100 DGX) and on three large models with synthetic and real workloads. Implementation built on vLLM and shows consistent improvements. Missing public code reduces reproducibility.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.
Who Should Care
Summary TLDR
ConServe is a serving system that co-runs latency-sensitive online LLM requests with latency-tolerant offline batches. It manages work at token, layer, and per-token KV-cache granularity to reclaim millisecond GPU idle cycles. On H100 hardware with Llama-3.1 and Qwen, ConServe keeps online P99 latency near an online-only baseline while increasing offline throughput by roughly 2–3× on evaluated workloads.
Problem Statement
GPU clusters for LLM serving sit idle often because online traffic is bursty. Existing co‑serving or preemption approaches operate at coarse granularity (per-request or per-iteration) and either harm online tail latency or waste offline throughput. The paper asks: can we harvest those idle cycles without violating strict online SLOs?
Main Contribution
SLO-aware token-level scheduler that uses a profiler-based latency model to decide how many offline tokens to add without breaking online TBT/TTFT SLOs.
Layer-wise (sub-iteration) preemption implemented with cheap safepoints between transformer layers to preempt offline work within milliseconds.
Key Findings
ConServe reduces online tail latency while co-serving.
ConServe raises offline throughput when co-serving.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| P99 online TTFT (tail time to first token) | reduced by up to 2.9× (paper average claim) compared to state-of-the-art co-serving baselines | state-of-the-art preemptive baselines (Sarathi-P, DistServe-P) | 2.9× reduction (avg) | evaluated workloads (synthetic + real traces) | Abstract; §6.2 | Abstract; §6.2 |
| P99 online TBT (time between tokens) | reduced by up to ~2.7× (reported averages vs baselines) | state-of-the-art preemptive baselines | ≈2.7× reduction (avg) | synthetic workloads (§6.2) | §6.2 (reported 2.72× in-text) | §6.2 |
What To Try In 7 Days
Run the one-time offline profiler on your model/hardware to collect the P vs context grid (paper says ~20 minutes for large models).
Enable token-level admission control: limit offline tokens per iteration using the profiler's can_schedule check.
Instrument safepoints between transformer layers and add incremental token-level KV checkpoints; test preemption latency on a small cluster slice.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires instrumenting model layers and modifying the serving engine (paper added ~9k LOC to vLLM).
Needs host memory headroom to checkpoint KV caches efficiently; very tight host memory reduces benefit.
When Not To Use
If you already run offline work on separate, idle clusters and prefer strict separation of workloads.
When host memory is extremely limited and swapping bandwidth is low (tight single-server setups).
Failure Modes
Profiler misprediction admits too many offline tokens, causing SLO violations.
Token-level checkpointing I/O could become a bottleneck on slow PCIe or overloaded host I/O.

