Overview
Production Readiness
0.75
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
DéjàVu cuts wasted GPU time, reduces memory needs, and shortens recovery after node failures—so you can serve larger LLMs cheaper and more reliably in pipeline-parallel clusters.
Summary TLDR
DéjàVu provides a KV-cache streaming library (DéjàVuLib) plus a serving design that: (1) splits prompt processing from token generation to eliminate pipeline bubbles, (2) swaps KV caches per microbatch between GPU and CPU to save GPU memory and let you use bigger batches, and (3) replicates KV-cache state to enable fast recovery on failures. On pipeline-parallel workloads the system reports up to 2× throughput vs FasterTransformer, up to 1.8× throughput from microbatch swapping, streaming overheads within ~2%, and substantial reduction in failure recovery work.
Problem Statement
Serving autoregressive LLMs at scale wastes GPUs and stalls on failures because prompt processing is much slower than per-token generation, frameworks over-allocate KV-cache memory for all in-flight microbatches, and loss of KV-cache state forces requests to restart from scratch.
Main Contribution
DéjàVuLib: a low-overhead, modular KV-cache streaming library with primitives for stream in/out, scatter/gather, and flush/fetch
Prompt-token disaggregation: dedicate machines to prompt processing vs token generation and a resource planner to partition them
Microbatch-level KV-cache swapping: move inactive microbatch cache to CPU to free GPU memory and enable larger batches
KV-cache replication and fast recovery protocol to resume from the last replicated token on failures
Evaluation across GPT2/OPT/BLOOM variants showing throughput, memory, and failure-recovery improvements
Key Findings
Disaggregation increases throughput versus a pipeline-parallel baseline
Microbatch swapping lets you use bigger batch sizes and raise throughput
KV-cache streaming is implemented with very low runtime overhead
Buffered copy optimization dramatically reduces small-copy overheads
Fault-tolerance reduces wasted recomputation after failures
Results
Throughput vs FasterTransformer (pipeline-parallel)
Throughput improvement from microbatch swapping
KV streaming runtime overhead
Buffered copy optimization
Failure-induced latency multiplier
Who Should Care
What To Try In 7 Days
Run DéjàVuLib microbenchmarks on a representative model to measure streaming overhead on your PCIe/network
Simulate prompt vs token timings and test disaggregation to see if throughput rises for your workloads
Enable microbatch swapping to see if you can double batch size without extra GPUs and measure throughput change
Agent Features
Memory
- microbatch-level KV-cache swapping
- KV-cache replication for fault tolerance
Planning
- resource allocation planner for prompt vs token pipelines
Tool Use
- DéjàVuLib primitives: stream out/in, scatter/gather, flush/fetch
Frameworks
- FasterTransformer integration
Architectures
- pipeline parallel
- tensor model parallel
Collaboration
- controller-worker coordination with heartbeats
Optimization Features
Token Efficiency
- supports larger batch sizes by freeing GPU KV memory
Infra Optimization
- CPU↔GPU swapping over PCIe
- support for local SSD and remote CPU streaming
- NCCL/OpenMPI/Boost copies for remote transfers
System Optimization
- resource allocation planner to balance prompt and token pipelines
- KV-cache replication and fast recovery protocol
Inference Optimization
- prompt-token disaggregation
- microbatch KV-cache swapping
- layer-by-layer prompt streaming
- buffered aggregation of small KV updates
- parallelized token streaming with computation
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Designed for pipeline-parallel, multi-node deployments; single-GPU cases may not benefit
- Benefits shrink if prompt KV streaming overhead m ≥ 2 (high-latency links or slow PCIe)
- Evaluation runs on specific GPUs and 40/32 Gbps links; results may vary on other infra
- Paper does not link to production code for immediate reuse
When Not To Use
- Models that fit entirely on a single GPU with no pipeline parallelism
- Workloads where prompts are tiny and per-token time dominates
- Environments with very slow CPU↔GPU links where streaming overhead exceeds gains
Failure Modes
- High PCIe or network latency can make streaming overheads dominate and negate disaggregation gains
- Replication adds extra network traffic; under very low failure rates this can slightly reduce throughput
- Complexity of orchestration (controller, heartbeats, background threads) increases operational surface
Core Entities
Models
- GPT2-1.5B
- OPT-13B
- OPT-30B
- OPT-66B
- BLOOM-176B
Metrics
- Throughput (req/sec)
- Normalized latency (seconds/token)
- Makespan
- GPU memory usage
- Streaming slowdown (%)
- Recovery latency multiplier (×)
Datasets
- LMSys-chat (LMSys dataset of generated token counts)
Benchmarks
- Microbenchmarks (single-batch KV streaming)
- End-to-end pipeline-parallel throughput and latency traces
- Simulator-based makespan and cost for traces
Context Entities
Models
- Megatron-style tensor-model parallel layers (context for pipeline+tensor mix)

