Share the prefill and KV cache across fine‑tuned models to cut tail latency and boost throughput in multi‑model agent serving.

February 12, 20267 min

Overview

Decision SnapshotReady For Pilot

Method is practical and tested on multiple backbones and workloads; main risks are implementation‑level KV staging overheads and integration with your serving stack.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 65%

Authors

Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Links

Abstract / PDF

Why It Matters For Business

If your product runs multiple fine‑tuned models over shared prompts (agents, planners, coders), PrefillShare can cut tail latency and GPU cost by reusing one prefill and KV cache across models while keeping task accuracy.

Who Should Care

Summary TLDR

PrefillShare splits a model into a frozen prefill module (builds a shared key-value cache from the prompt) and many small task-specific decode modules. Decode modules are fine-tuned to consume the shared cache (cache-conditioned fine‑tuning). This enables safe KV reuse across heterogeneous fine‑tuned models in disaggregated serving, preserving accuracy while cutting p95 tail latency by up to 4.5× and raising throughput by up to 3.9× in multi‑agent workloads.

Problem Statement

Multi‑model agent workflows often run several fine‑tuned LLMs over the same shared prompt. Each model repeats the costly prefill stage and keeps its own KV cache, causing duplicated compute, growing memory use, and worse tail latency. Existing disaggregated serving reduces interference but cannot reuse KV caches across different models because caches depend on model parameters.

Main Contribution

PrefillShare algorithm that factorizes models into a shared frozen prefill module and task-specific decode modules to enable cross‑model KV cache reuse.

Cache‑conditioned fine‑tuning: freeze prefill, fine‑tune only decoders to reliably consume the base KV cache.

Key Findings

PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.

NumbersAccuracy within ≈1% of Full‑FT on evaluated benchmarks (Table 1).

Practical UseYou can freeze a shared prefill and fine‑tune only decoders without losing task accuracy on evaluated tasks; reduces retraining cost and simplifies KV reuse.

Evidence RefTable 1

PrefillShare substantially reduces tail latency and increases throughput under high load.

NumbersUp to 4.5× lower p95 latency and up to 3.9× higher throughput on agent workloads.

Practical UseDeploying PrefillShare in a disaggregated stack yields much lower tail latency and higher token throughput for multi‑model agent sessions under heavy load.

Evidence RefSection 4.3; Figures 3 and 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyFull‑FT 71.3% vs PrefillShare 71.4%Full‑FT+0.1%GSM8K (Table 1)Cache-conditioned fine‑tuning matches Full‑FT on GSM8K for LLaMA3.1-8B.Table 1
AccuracyFull‑FT 48.2% vs PrefillShare 48.8%Full‑FT+0.6%HumanEval (Table 1)PrefillShare equals or slightly exceeds Full‑FT on evaluated coding tasks.Table 1

What To Try In 7 Days

Instrument your multi‑model sessions and measure shared prefix rates and KV footprint.

Prototype freezing a base prefill model and fine‑tuning only small decoder heads on one task-dataset.

Run a small disaggregated test: route a session through shared prefill + single specialized decoder and measure p95 and throughput.

Agent Features

Memory
shared KV cache across decoders
Planning
prefix-aware routing
Tool Use
sequential specialized decoders
Frameworks
vLLM
Is Agentic

Yes

Architectures
multi-model agent pipelinesdisaggregated prefill/decode
Collaboration
multiple fine‑tuned models in a session

Optimization Features

Token Efficiency
reduced repeated prefill compute per token
Infra Optimization
reduced GPU memory duplication for KV caches
Model Optimization
freeze prefill module
System Optimization
disaggregated prefill/decode to avoid prefill-decode interferencesingle shared prefill pool to reduce KV duplication
Training Optimization
cache‑conditioned fine‑tuning (fine‑tune only decoders)
Inference Optimization
shared prefill KV reuseprefix‑aware routing and cache handoff

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

At very high concurrency, decode‑side KV staging/reload causes CPU‑GPU traffic and throughput drop.

Requires a frozen shared prefill and fine‑tuning pipeline for decoders; not a drop‑in for black‑box hosted models.

When Not To Use

Your workloads rarely share long prompt prefixes across models.

You use third‑party closed models you cannot fine‑tune or host.

Failure Modes

Naive KV reuse without cache‑conditioned fine‑tuning collapses accuracy as sharing ratio increases.

Excessive CPU‑GPU KV staging under high concurrency reduces throughput despite high cache hit ratio.

Core Entities

Models

LLaMA3.1-8BQwen3-1.7BQwen3-8BQwen3-14B

Metrics

p95 latencythroughput (tok/s)TTFT (time to first token)prefix cache hit ratioAccuracy

Datasets

MetaMathQA-40KEvolInstruct-Code-80KxLAM-function-calling-60K

Benchmarks

GSM8KGSM+HumanEvalHumanEval+BFCL (Simple Python, Multiple)

Context Entities

Models

vLLM disaggregated serving prototype