Share the prefill and KV cache across fine‑tuned models to cut tail latency and boost throughput in multi‑model agent serving.

February 12, 20267 min

Overview

Production Readiness

0.75

Novelty Score

0.65

Cost Impact Score

0.8

Citation Count

0

Authors

Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Links

Abstract / PDF

Why It Matters For Business

If your product runs multiple fine‑tuned models over shared prompts (agents, planners, coders), PrefillShare can cut tail latency and GPU cost by reusing one prefill and KV cache across models while keeping task accuracy.

Summary TLDR

PrefillShare splits a model into a frozen prefill module (builds a shared key-value cache from the prompt) and many small task-specific decode modules. Decode modules are fine-tuned to consume the shared cache (cache-conditioned fine‑tuning). This enables safe KV reuse across heterogeneous fine‑tuned models in disaggregated serving, preserving accuracy while cutting p95 tail latency by up to 4.5× and raising throughput by up to 3.9× in multi‑agent workloads.

Problem Statement

Multi‑model agent workflows often run several fine‑tuned LLMs over the same shared prompt. Each model repeats the costly prefill stage and keeps its own KV cache, causing duplicated compute, growing memory use, and worse tail latency. Existing disaggregated serving reduces interference but cannot reuse KV caches across different models because caches depend on model parameters.

Main Contribution

PrefillShare algorithm that factorizes models into a shared frozen prefill module and task-specific decode modules to enable cross‑model KV cache reuse.

Cache‑conditioned fine‑tuning: freeze prefill, fine‑tune only decoders to reliably consume the base KV cache.

A prefix‑aware routing and cache‑handoff mechanism for vLLM‑style disaggregated serving that preserves prefix cache locality across model switches.

Key Findings

PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.

NumbersAccuracy within ≈1% of Full‑FT on evaluated benchmarks (Table 1).

PrefillShare substantially reduces tail latency and increases throughput under high load.

NumbersUp to 4.5× lower p95 latency and up to 3.9× higher throughput on agent workloads.

Shared prefill dramatically improves prefix cache hit ratio under concurrency.

NumbersPrefillShare keeps prefix cache hit ratio ≈89% vs baseline peak ~60% then falling beyond ≈40 sessions.

Results

Accuracy

ValueFull‑FT 71.3% vs PrefillShare 71.4%

BaselineFull‑FT

Accuracy

ValueFull‑FT 48.2% vs PrefillShare 48.8%

BaselineFull‑FT

Tail latency (p95)

ValueUp to 4.5× lower p95 latency

BaselineDisaggregated baseline

Throughput

ValueUp to 3.9× higher throughput

BaselineDisaggregated baseline

Prefix cache hit ratio

ValuePrefillShare ≈89% vs baseline peak ~60%

BaselineDisaggregated baseline

Who Should Care

What To Try In 7 Days

Instrument your multi‑model sessions and measure shared prefix rates and KV footprint.

Prototype freezing a base prefill model and fine‑tuning only small decoder heads on one task-dataset.

Run a small disaggregated test: route a session through shared prefill + single specialized decoder and measure p95 and throughput.

Agent Features

Memory

  • shared KV cache across decoders

Planning

  • prefix-aware routing

Tool Use

  • sequential specialized decoders

Frameworks

  • vLLM

Is Agentic

true

Architectures

  • multi-model agent pipelines
  • disaggregated prefill/decode

Collaboration

  • multiple fine‑tuned models in a session

Optimization Features

Token Efficiency

  • reduced repeated prefill compute per token

Infra Optimization

  • reduced GPU memory duplication for KV caches

Model Optimization

  • freeze prefill module

System Optimization

  • disaggregated prefill/decode to avoid prefill-decode interference
  • single shared prefill pool to reduce KV duplication

Training Optimization

  • cache‑conditioned fine‑tuning (fine‑tune only decoders)

Inference Optimization

  • shared prefill KV reuse
  • prefix‑aware routing and cache handoff

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • At very high concurrency, decode‑side KV staging/reload causes CPU‑GPU traffic and throughput drop.
  • Requires a frozen shared prefill and fine‑tuning pipeline for decoders; not a drop‑in for black‑box hosted models.
  • Performance benefit depends on high shared‑prefix rates; small or diverse prompts give less gain.

When Not To Use

  • Your workloads rarely share long prompt prefixes across models.
  • You use third‑party closed models you cannot fine‑tune or host.
  • Your serving stack cannot support disaggregated prefill/decode or handle KV handoff.

Failure Modes

  • Naive KV reuse without cache‑conditioned fine‑tuning collapses accuracy as sharing ratio increases.
  • Excessive CPU‑GPU KV staging under high concurrency reduces throughput despite high cache hit ratio.
  • Routing mistakes that break prefix locality can force repeated prefill work and remove benefits.

Core Entities

Models

  • LLaMA3.1-8B
  • Qwen3-1.7B
  • Qwen3-8B
  • Qwen3-14B

Metrics

  • p95 latency
  • throughput (tok/s)
  • TTFT (time to first token)
  • prefix cache hit ratio
  • Accuracy

Datasets

  • MetaMathQA-40K
  • EvolInstruct-Code-80K
  • xLAM-function-calling-60K

Benchmarks

  • GSM8K
  • GSM+
  • HumanEval
  • HumanEval+
  • BFCL (Simple Python, Multiple)

Context Entities

Models

  • vLLM disaggregated serving prototype