Overview
Production Readiness
0.75
Novelty Score
0.65
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
If your product runs multiple fine‑tuned models over shared prompts (agents, planners, coders), PrefillShare can cut tail latency and GPU cost by reusing one prefill and KV cache across models while keeping task accuracy.
Summary TLDR
PrefillShare splits a model into a frozen prefill module (builds a shared key-value cache from the prompt) and many small task-specific decode modules. Decode modules are fine-tuned to consume the shared cache (cache-conditioned fine‑tuning). This enables safe KV reuse across heterogeneous fine‑tuned models in disaggregated serving, preserving accuracy while cutting p95 tail latency by up to 4.5× and raising throughput by up to 3.9× in multi‑agent workloads.
Problem Statement
Multi‑model agent workflows often run several fine‑tuned LLMs over the same shared prompt. Each model repeats the costly prefill stage and keeps its own KV cache, causing duplicated compute, growing memory use, and worse tail latency. Existing disaggregated serving reduces interference but cannot reuse KV caches across different models because caches depend on model parameters.
Main Contribution
PrefillShare algorithm that factorizes models into a shared frozen prefill module and task-specific decode modules to enable cross‑model KV cache reuse.
Cache‑conditioned fine‑tuning: freeze prefill, fine‑tune only decoders to reliably consume the base KV cache.
A prefix‑aware routing and cache‑handoff mechanism for vLLM‑style disaggregated serving that preserves prefix cache locality across model switches.
Key Findings
PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.
PrefillShare substantially reduces tail latency and increases throughput under high load.
Shared prefill dramatically improves prefix cache hit ratio under concurrency.
Results
Accuracy
Accuracy
Tail latency (p95)
Throughput
Prefix cache hit ratio
Who Should Care
What To Try In 7 Days
Instrument your multi‑model sessions and measure shared prefix rates and KV footprint.
Prototype freezing a base prefill model and fine‑tuning only small decoder heads on one task-dataset.
Run a small disaggregated test: route a session through shared prefill + single specialized decoder and measure p95 and throughput.
Agent Features
Memory
- shared KV cache across decoders
Planning
- prefix-aware routing
Tool Use
- sequential specialized decoders
Frameworks
- vLLM
Is Agentic
true
Architectures
- multi-model agent pipelines
- disaggregated prefill/decode
Collaboration
- multiple fine‑tuned models in a session
Optimization Features
Token Efficiency
- reduced repeated prefill compute per token
Infra Optimization
- reduced GPU memory duplication for KV caches
Model Optimization
- freeze prefill module
System Optimization
- disaggregated prefill/decode to avoid prefill-decode interference
- single shared prefill pool to reduce KV duplication
Training Optimization
- cache‑conditioned fine‑tuning (fine‑tune only decoders)
Inference Optimization
- shared prefill KV reuse
- prefix‑aware routing and cache handoff
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- At very high concurrency, decode‑side KV staging/reload causes CPU‑GPU traffic and throughput drop.
- Requires a frozen shared prefill and fine‑tuning pipeline for decoders; not a drop‑in for black‑box hosted models.
- Performance benefit depends on high shared‑prefix rates; small or diverse prompts give less gain.
When Not To Use
- Your workloads rarely share long prompt prefixes across models.
- You use third‑party closed models you cannot fine‑tune or host.
- Your serving stack cannot support disaggregated prefill/decode or handle KV handoff.
Failure Modes
- Naive KV reuse without cache‑conditioned fine‑tuning collapses accuracy as sharing ratio increases.
- Excessive CPU‑GPU KV staging under high concurrency reduces throughput despite high cache hit ratio.
- Routing mistakes that break prefix locality can force repeated prefill work and remove benefits.
Core Entities
Models
- LLaMA3.1-8B
- Qwen3-1.7B
- Qwen3-8B
- Qwen3-14B
Metrics
- p95 latency
- throughput (tok/s)
- TTFT (time to first token)
- prefix cache hit ratio
- Accuracy
Datasets
- MetaMathQA-40K
- EvolInstruct-Code-80K
- xLAM-function-calling-60K
Benchmarks
- GSM8K
- GSM+
- HumanEval
- HumanEval+
- BFCL (Simple Python, Multiple)
Context Entities
Models
- vLLM disaggregated serving prototype

