Overview
Evidence shows consistent throughput and failure-recovery wins on multiple large models and practical GPU setups, but results depend on PCIe/network speed and pipeline settings.
Citations1
Evidence Strength0.85
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 75%
Novelty: 60%
Why It Matters For Business
DéjàVu cuts wasted GPU time, reduces memory needs, and shortens recovery after node failures—so you can serve larger LLMs cheaper and more reliably in pipeline-parallel clusters.
Who Should Care
Summary TLDR
DéjàVu provides a KV-cache streaming library (DéjàVuLib) plus a serving design that: (1) splits prompt processing from token generation to eliminate pipeline bubbles, (2) swaps KV caches per microbatch between GPU and CPU to save GPU memory and let you use bigger batches, and (3) replicates KV-cache state to enable fast recovery on failures. On pipeline-parallel workloads the system reports up to 2× throughput vs FasterTransformer, up to 1.8× throughput from microbatch swapping, streaming overheads within ~2%, and substantial reduction in failure recovery work.
Problem Statement
Serving autoregressive LLMs at scale wastes GPUs and stalls on failures because prompt processing is much slower than per-token generation, frameworks over-allocate KV-cache memory for all in-flight microbatches, and loss of KV-cache state forces requests to restart from scratch.
Main Contribution
DéjàVuLib: a low-overhead, modular KV-cache streaming library with primitives for stream in/out, scatter/gather, and flush/fetch
Prompt-token disaggregation: dedicate machines to prompt processing vs token generation and a resource planner to partition them
Key Findings
Disaggregation increases throughput versus a pipeline-parallel baseline
Microbatch swapping lets you use bigger batch sizes and raise throughput
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Throughput vs FasterTransformer (pipeline-parallel) | up to 2× | FasterTransformer | ×2 | OPT-66B, BLOOM-176B; LMSys workload | Figure 12; §5.2.1 | §5.2.1, Fig.12 |
| Throughput improvement from microbatch swapping | up to 1.8× | no swapping (same GPUs) | ×1.8 | OPT/BLOOM variants | Figure 13; §5.2.2 | §5.2.2, Fig.13 |
What To Try In 7 Days
Run DéjàVuLib microbenchmarks on a representative model to measure streaming overhead on your PCIe/network
Simulate prompt vs token timings and test disaggregation to see if throughput rises for your workloads
Enable microbatch swapping to see if you can double batch size without extra GPUs and measure throughput change
Agent Features
Memory
Planning
Tool Use
Frameworks
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Designed for pipeline-parallel, multi-node deployments; single-GPU cases may not benefit
Benefits shrink if prompt KV streaming overhead m ≥ 2 (high-latency links or slow PCIe)
When Not To Use
Models that fit entirely on a single GPU with no pipeline parallelism
Workloads where prompts are tiny and per-token time dominates
Failure Modes
High PCIe or network latency can make streaming overheads dominate and negate disaggregation gains
Replication adds extra network traffic; under very low failure rates this can slightly reduce throughput

