Share the prefill and KV cache across fine‑tuned models to cut tail latency and boost throughput in multi‑model agent serving.

Overview

Decision SnapshotReady For Pilot

Method is practical and tested on multiple backbones and workloads; main risks are implementation‑level KV staging overheads and integration with your serving stack.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 65%

Authors

Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Links

Abstract / PDF

Why It Matters For Business

If your product runs multiple fine‑tuned models over shared prompts (agents, planners, coders), PrefillShare can cut tail latency and GPU cost by reusing one prefill and KV cache across models while keeping task accuracy.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

PrefillShare splits a model into a frozen prefill module (builds a shared key-value cache from the prompt) and many small task-specific decode modules. Decode modules are fine-tuned to consume the shared cache (cache-conditioned fine‑tuning). This enables safe KV reuse across heterogeneous fine‑tuned models in disaggregated serving, preserving accuracy while cutting p95 tail latency by up to 4.5× and raising throughput by up to 3.9× in multi‑agent workloads.

Problem Statement

Multi‑model agent workflows often run several fine‑tuned LLMs over the same shared prompt. Each model repeats the costly prefill stage and keeps its own KV cache, causing duplicated compute, growing memory use, and worse tail latency. Existing disaggregated serving reduces interference but cannot reuse KV caches across different models because caches depend on model parameters.

Main Contribution

PrefillShare algorithm that factorizes models into a shared frozen prefill module and task-specific decode modules to enable cross‑model KV cache reuse.

Cache‑conditioned fine‑tuning: freeze prefill, fine‑tune only decoders to reliably consume the base KV cache.

Key Findings

PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.

NumbersAccuracy within ≈1% of Full‑FT on evaluated benchmarks (Table 1).

Practical UseYou can freeze a shared prefill and fine‑tune only decoders without losing task accuracy on evaluated tasks; reduces retraining cost and simplifies KV reuse.

Evidence RefTable 1

PrefillShare substantially reduces tail latency and increases throughput under high load.

NumbersUp to 4.5× lower p95 latency and up to 3.9× higher throughput on agent workloads.

Practical UseDeploying PrefillShare in a disaggregated stack yields much lower tail latency and higher token throughput for multi‑model agent sessions under heavy load.

Evidence RefSection 4.3; Figures 3 and 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Full‑FT 71.3% vs PrefillShare 71.4%	Full‑FT	+0.1%	GSM8K (Table 1)	Cache-conditioned fine‑tuning matches Full‑FT on GSM8K for LLaMA3.1-8B.	Table 1
Accuracy	Full‑FT 48.2% vs PrefillShare 48.8%	Full‑FT	+0.6%	HumanEval (Table 1)	PrefillShare equals or slightly exceeds Full‑FT on evaluated coding tasks.	Table 1

What To Try In 7 Days

Instrument your multi‑model sessions and measure shared prefix rates and KV footprint.

Prototype freezing a base prefill model and fine‑tuning only small decoder heads on one task-dataset.

Run a small disaggregated test: route a session through shared prefill + single specialized decoder and measure p95 and throughput.

Agent Features

Memory

shared KV cache across decoders

Planning

prefix-aware routing

Tool Use

sequential specialized decoders

Frameworks

vLLM

Is Agentic

Yes

Architectures

multi-model agent pipelinesdisaggregated prefill/decode

Collaboration

multiple fine‑tuned models in a session

Optimization Features

Token Efficiency

reduced repeated prefill compute per token

Infra Optimization

reduced GPU memory duplication for KV caches

Model Optimization

freeze prefill module

System Optimization

disaggregated prefill/decode to avoid prefill-decode interferencesingle shared prefill pool to reduce KV duplication

Training Optimization

cache‑conditioned fine‑tuning (fine‑tune only decoders)

Inference Optimization

shared prefill KV reuseprefix‑aware routing and cache handoff

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

At very high concurrency, decode‑side KV staging/reload causes CPU‑GPU traffic and throughput drop.

Requires a frozen shared prefill and fine‑tuning pipeline for decoders; not a drop‑in for black‑box hosted models.

When Not To Use

Your workloads rarely share long prompt prefixes across models.

You use third‑party closed models you cannot fine‑tune or host.

Failure Modes

Naive KV reuse without cache‑conditioned fine‑tuning collapses accuracy as sharing ratio increases.

Excessive CPU‑GPU KV staging under high concurrency reduces throughput despite high cache hit ratio.

Core Entities

Models

LLaMA3.1-8BQwen3-1.7BQwen3-8BQwen3-14B

Metrics

p95 latencythroughput (tok/s)TTFT (time to first token)prefix cache hit ratioAccuracy

Datasets

MetaMathQA-40KEvolInstruct-Code-80KxLAM-function-calling-60K

Benchmarks

GSM8KGSM+HumanEvalHumanEval+BFCL (Simple Python, Multiple)

Context Entities

Models

vLLM disaggregated serving prototype

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.

PrefillShare substantially reduces tail latency and increases throughput under high load.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding