Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.
Summary TLDR
ConServe is a serving system that co-runs latency-sensitive online LLM requests with latency-tolerant offline batches. It manages work at token, layer, and per-token KV-cache granularity to reclaim millisecond GPU idle cycles. On H100 hardware with Llama-3.1 and Qwen, ConServe keeps online P99 latency near an online-only baseline while increasing offline throughput by roughly 2–3× on evaluated workloads.
Problem Statement
GPU clusters for LLM serving sit idle often because online traffic is bursty. Existing co‑serving or preemption approaches operate at coarse granularity (per-request or per-iteration) and either harm online tail latency or waste offline throughput. The paper asks: can we harvest those idle cycles without violating strict online SLOs?
Main Contribution
SLO-aware token-level scheduler that uses a profiler-based latency model to decide how many offline tokens to add without breaking online TBT/TTFT SLOs.
Layer-wise (sub-iteration) preemption implemented with cheap safepoints between transformer layers to preempt offline work within milliseconds.
Incremental token-level KV cache checkpointing and background prefetch to preempt and resume offline work at near-zero cost.
Key Findings
ConServe reduces online tail latency while co-serving.
ConServe raises offline throughput when co-serving.
Token- and layer-granular control makes preemption fast.
Incremental KV checkpointing keeps preemption cost low.
Results
P99 online TTFT (tail time to first token)
P99 online TBT (time between tokens)
Offline throughput (tokens/s)
Offline throughput vs optimal (Non-Preemptive)
Who Should Care
What To Try In 7 Days
Run the one-time offline profiler on your model/hardware to collect the P vs context grid (paper says ~20 minutes for large models).
Enable token-level admission control: limit offline tokens per iteration using the profiler's can_schedule check.
Instrument safepoints between transformer layers and add incremental token-level KV checkpoints; test preemption latency on a small cluster slice.
Optimization Features
Token Efficiency
- token budgeting per iteration
- dynamic admission of offline tokens
Infra Optimization
- overlap device-host KV transfer with compute
- NVLink/PCIe-aware checkpoint bandwidth utilization
System Optimization
- profiler-based latency model (P, C polynomial)
- safepoints with low-cost host-memory flag read
- separate CUDA stream for KV transfers
Inference Optimization
- SLO-aware token-level scheduling
- layer-wise (sub-iteration) preemption
- incremental token-level KV checkpointing
- background KV prefetching
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires instrumenting model layers and modifying the serving engine (paper added ~9k LOC to vLLM).
- Needs host memory headroom to checkpoint KV caches efficiently; very tight host memory reduces benefit.
- Profiler is hardware- and model-specific; miscalibrated models can under/over-allocate offline tokens.
- Gains shrink for extreme request shapes: very long inputs (>16K) or tiny outputs (<128 tokens) leave little spare capacity.
When Not To Use
- If you already run offline work on separate, idle clusters and prefer strict separation of workloads.
- When host memory is extremely limited and swapping bandwidth is low (tight single-server setups).
- For models or deployments where modifying the serving stack or model graphs is disallowed.
Failure Modes
- Profiler misprediction admits too many offline tokens, causing SLO violations.
- Token-level checkpointing I/O could become a bottleneck on slow PCIe or overloaded host I/O.
- Tensor-parallel synchronization issues if safepoints are not correctly broadcast across workers.
- Background prefetching may not keep up under heavy eviction patterns, forcing recomputation.
Core Entities
Models
- Llama-3.1 8B
- Llama-3.1 70B
- Qwen-2.5 14B
Metrics
- P99 TTFT (time to first token)
- P99 TBT (time between tokens)
- offline throughput (tokens/s)
Datasets
- BurstGPT (online trace)
- DuReader
- MultiNews
- VCSUM

