Stream KV-cache to cut pipeline bubbles, reduce GPU memory, and recover fast for pipeline-parallel LLMs

March 4, 20247 min

Overview

Production Readiness

0.75

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic

Links

Abstract / PDF

Why It Matters For Business

DéjàVu cuts wasted GPU time, reduces memory needs, and shortens recovery after node failures—so you can serve larger LLMs cheaper and more reliably in pipeline-parallel clusters.

Summary TLDR

DéjàVu provides a KV-cache streaming library (DéjàVuLib) plus a serving design that: (1) splits prompt processing from token generation to eliminate pipeline bubbles, (2) swaps KV caches per microbatch between GPU and CPU to save GPU memory and let you use bigger batches, and (3) replicates KV-cache state to enable fast recovery on failures. On pipeline-parallel workloads the system reports up to 2× throughput vs FasterTransformer, up to 1.8× throughput from microbatch swapping, streaming overheads within ~2%, and substantial reduction in failure recovery work.

Problem Statement

Serving autoregressive LLMs at scale wastes GPUs and stalls on failures because prompt processing is much slower than per-token generation, frameworks over-allocate KV-cache memory for all in-flight microbatches, and loss of KV-cache state forces requests to restart from scratch.

Main Contribution

DéjàVuLib: a low-overhead, modular KV-cache streaming library with primitives for stream in/out, scatter/gather, and flush/fetch

Prompt-token disaggregation: dedicate machines to prompt processing vs token generation and a resource planner to partition them

Microbatch-level KV-cache swapping: move inactive microbatch cache to CPU to free GPU memory and enable larger batches

KV-cache replication and fast recovery protocol to resume from the last replicated token on failures

Evaluation across GPT2/OPT/BLOOM variants showing throughput, memory, and failure-recovery improvements

Key Findings

Disaggregation increases throughput versus a pipeline-parallel baseline

NumbersUp to 2× throughput improvement vs FasterTransformer on OPT-66B and BLOOM-176B

Microbatch swapping lets you use bigger batch sizes and raise throughput

NumbersUp to 1.8× throughput gain by doubling batch size with swapping

KV-cache streaming is implemented with very low runtime overhead

NumbersStreaming slowdown within ~2% vs no streaming for a single batch

Buffered copy optimization dramatically reduces small-copy overheads

NumbersBuffered copies gave 95× improvement vs naive contiguous-copy baseline

Fault-tolerance reduces wasted recomputation after failures

NumbersBaseline failure caused 1.91× microbatch latency increase; DéjàVu caused 1.24× increase (faster recovery)

Results

Throughput vs FasterTransformer (pipeline-parallel)

Valueup to 2×

BaselineFasterTransformer

Throughput improvement from microbatch swapping

Valueup to 1.8×

Baselineno swapping (same GPUs)

KV streaming runtime overhead

Valuewithin 2% slowdown

Baselineno streaming

Buffered copy optimization

Value95× faster than naive copy

Baselinebaseline contiguous transfers per chunk

Failure-induced latency multiplier

ValueBaseline 1.91× vs DéjàVu 1.24×

BaselineFasterTransformer baseline

Who Should Care

What To Try In 7 Days

Run DéjàVuLib microbenchmarks on a representative model to measure streaming overhead on your PCIe/network

Simulate prompt vs token timings and test disaggregation to see if throughput rises for your workloads

Enable microbatch swapping to see if you can double batch size without extra GPUs and measure throughput change

Agent Features

Memory

  • microbatch-level KV-cache swapping
  • KV-cache replication for fault tolerance

Planning

  • resource allocation planner for prompt vs token pipelines

Tool Use

  • DéjàVuLib primitives: stream out/in, scatter/gather, flush/fetch

Frameworks

  • FasterTransformer integration

Architectures

  • pipeline parallel
  • tensor model parallel

Collaboration

  • controller-worker coordination with heartbeats

Optimization Features

Token Efficiency

  • supports larger batch sizes by freeing GPU KV memory

Infra Optimization

  • CPU↔GPU swapping over PCIe
  • support for local SSD and remote CPU streaming
  • NCCL/OpenMPI/Boost copies for remote transfers

System Optimization

  • resource allocation planner to balance prompt and token pipelines
  • KV-cache replication and fast recovery protocol

Inference Optimization

  • prompt-token disaggregation
  • microbatch KV-cache swapping
  • layer-by-layer prompt streaming
  • buffered aggregation of small KV updates
  • parallelized token streaming with computation

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Designed for pipeline-parallel, multi-node deployments; single-GPU cases may not benefit
  • Benefits shrink if prompt KV streaming overhead m ≥ 2 (high-latency links or slow PCIe)
  • Evaluation runs on specific GPUs and 40/32 Gbps links; results may vary on other infra
  • Paper does not link to production code for immediate reuse

When Not To Use

  • Models that fit entirely on a single GPU with no pipeline parallelism
  • Workloads where prompts are tiny and per-token time dominates
  • Environments with very slow CPU↔GPU links where streaming overhead exceeds gains

Failure Modes

  • High PCIe or network latency can make streaming overheads dominate and negate disaggregation gains
  • Replication adds extra network traffic; under very low failure rates this can slightly reduce throughput
  • Complexity of orchestration (controller, heartbeats, background threads) increases operational surface

Core Entities

Models

  • GPT2-1.5B
  • OPT-13B
  • OPT-30B
  • OPT-66B
  • BLOOM-176B

Metrics

  • Throughput (req/sec)
  • Normalized latency (seconds/token)
  • Makespan
  • GPU memory usage
  • Streaming slowdown (%)
  • Recovery latency multiplier (×)

Datasets

  • LMSys-chat (LMSys dataset of generated token counts)

Benchmarks

  • Microbenchmarks (single-batch KV streaming)
  • End-to-end pipeline-parallel throughput and latency traces
  • Simulator-based makespan and cost for traces

Context Entities

Models

  • Megatron-style tensor-model parallel layers (context for pipeline+tensor mix)