Stream KV-cache to cut pipeline bubbles, reduce GPU memory, and recover fast for pipeline-parallel LLMs

March 4, 20247 min

Overview

Decision SnapshotReady For Pilot

Evidence shows consistent throughput and failure-recovery wins on multiple large models and practical GPU setups, but results depend on PCIe/network speed and pipeline settings.

Citations1

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 75%

Novelty: 60%

Authors

Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic

Links

Abstract / PDF

Why It Matters For Business

DéjàVu cuts wasted GPU time, reduces memory needs, and shortens recovery after node failures—so you can serve larger LLMs cheaper and more reliably in pipeline-parallel clusters.

Who Should Care

Summary TLDR

DéjàVu provides a KV-cache streaming library (DéjàVuLib) plus a serving design that: (1) splits prompt processing from token generation to eliminate pipeline bubbles, (2) swaps KV caches per microbatch between GPU and CPU to save GPU memory and let you use bigger batches, and (3) replicates KV-cache state to enable fast recovery on failures. On pipeline-parallel workloads the system reports up to 2× throughput vs FasterTransformer, up to 1.8× throughput from microbatch swapping, streaming overheads within ~2%, and substantial reduction in failure recovery work.

Problem Statement

Serving autoregressive LLMs at scale wastes GPUs and stalls on failures because prompt processing is much slower than per-token generation, frameworks over-allocate KV-cache memory for all in-flight microbatches, and loss of KV-cache state forces requests to restart from scratch.

Main Contribution

DéjàVuLib: a low-overhead, modular KV-cache streaming library with primitives for stream in/out, scatter/gather, and flush/fetch

Prompt-token disaggregation: dedicate machines to prompt processing vs token generation and a resource planner to partition them

Key Findings

Disaggregation increases throughput versus a pipeline-parallel baseline

NumbersUp to throughput improvement vs FasterTransformer on OPT-66B and BLOOM-176B

Practical UseIf you run pipeline-parallel LLMs, split prompt and decode pipelines to raise throughput when prompt times are large.

Evidence RefAbstract, §5.2.1, Fig.12

Microbatch swapping lets you use bigger batch sizes and raise throughput

NumbersUp to 1.8× throughput gain by doubling batch size with swapping

Practical UseIf GPU memory limits batch size, implement microbatch swapping to move inactive KV caches to CPU and increase throughput.

Evidence RefAbstract, §5.2.2, Fig.13

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Throughput vs FasterTransformer (pipeline-parallel)up to FasterTransformer×2OPT-66B, BLOOM-176B; LMSys workloadFigure 12; §5.2.1§5.2.1, Fig.12
Throughput improvement from microbatch swappingup to 1.8×no swapping (same GPUs)×1.8OPT/BLOOM variantsFigure 13; §5.2.2§5.2.2, Fig.13

What To Try In 7 Days

Run DéjàVuLib microbenchmarks on a representative model to measure streaming overhead on your PCIe/network

Simulate prompt vs token timings and test disaggregation to see if throughput rises for your workloads

Enable microbatch swapping to see if you can double batch size without extra GPUs and measure throughput change

Agent Features

Memory
microbatch-level KV-cache swappingKV-cache replication for fault tolerance
Planning
resource allocation planner for prompt vs token pipelines
Tool Use
DéjàVuLib primitives: stream out/in, scatter/gather, flush/fetch
Frameworks
FasterTransformer integration
Architectures
pipeline paralleltensor model parallel
Collaboration
controller-worker coordination with heartbeats

Optimization Features

Token Efficiency
supports larger batch sizes by freeing GPU KV memory
Infra Optimization
CPU↔GPU swapping over PCIesupport for local SSD and remote CPU streamingNCCL/OpenMPI/Boost copies for remote transfers
System Optimization
resource allocation planner to balance prompt and token pipelinesKV-cache replication and fast recovery protocol
Inference Optimization
prompt-token disaggregationmicrobatch KV-cache swappinglayer-by-layer prompt streamingbuffered aggregation of small KV updatesparallelized token streaming with computation

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Designed for pipeline-parallel, multi-node deployments; single-GPU cases may not benefit

Benefits shrink if prompt KV streaming overhead m ≥ 2 (high-latency links or slow PCIe)

When Not To Use

Models that fit entirely on a single GPU with no pipeline parallelism

Workloads where prompts are tiny and per-token time dominates

Failure Modes

High PCIe or network latency can make streaming overheads dominate and negate disaggregation gains

Replication adds extra network traffic; under very low failure rates this can slightly reduce throughput

Core Entities

Models

GPT2-1.5BOPT-13BOPT-30BOPT-66BBLOOM-176B

Metrics

Throughput (req/sec)Normalized latency (seconds/token)MakespanGPU memory usageStreaming slowdown (%)Recovery latency multiplier (×)

Datasets

LMSys-chat (LMSys dataset of generated token counts)

Benchmarks

Microbenchmarks (single-batch KV streaming)End-to-end pipeline-parallel throughput and latency tracesSimulator-based makespan and cost for traces

Context Entities

Models

Megatron-style tensor-model parallel layers (context for pipeline+tensor mix)