Overview
Method is practical and tested on multiple backbones and workloads; main risks are implementation‑level KV staging overheads and integration with your serving stack.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 75%
Novelty: 65%
Why It Matters For Business
If your product runs multiple fine‑tuned models over shared prompts (agents, planners, coders), PrefillShare can cut tail latency and GPU cost by reusing one prefill and KV cache across models while keeping task accuracy.
Who Should Care
Summary TLDR
PrefillShare splits a model into a frozen prefill module (builds a shared key-value cache from the prompt) and many small task-specific decode modules. Decode modules are fine-tuned to consume the shared cache (cache-conditioned fine‑tuning). This enables safe KV reuse across heterogeneous fine‑tuned models in disaggregated serving, preserving accuracy while cutting p95 tail latency by up to 4.5× and raising throughput by up to 3.9× in multi‑agent workloads.
Problem Statement
Multi‑model agent workflows often run several fine‑tuned LLMs over the same shared prompt. Each model repeats the costly prefill stage and keeps its own KV cache, causing duplicated compute, growing memory use, and worse tail latency. Existing disaggregated serving reduces interference but cannot reuse KV caches across different models because caches depend on model parameters.
Main Contribution
PrefillShare algorithm that factorizes models into a shared frozen prefill module and task-specific decode modules to enable cross‑model KV cache reuse.
Cache‑conditioned fine‑tuning: freeze prefill, fine‑tune only decoders to reliably consume the base KV cache.
Key Findings
PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.
PrefillShare substantially reduces tail latency and increases throughput under high load.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Full‑FT 71.3% vs PrefillShare 71.4% | Full‑FT | +0.1% | GSM8K (Table 1) | Cache-conditioned fine‑tuning matches Full‑FT on GSM8K for LLaMA3.1-8B. | Table 1 |
| Accuracy | Full‑FT 48.2% vs PrefillShare 48.8% | Full‑FT | +0.6% | HumanEval (Table 1) | PrefillShare equals or slightly exceeds Full‑FT on evaluated coding tasks. | Table 1 |
What To Try In 7 Days
Instrument your multi‑model sessions and measure shared prefix rates and KV footprint.
Prototype freezing a base prefill model and fine‑tuning only small decoder heads on one task-dataset.
Run a small disaggregated test: route a session through shared prefill + single specialized decoder and measure p95 and throughput.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
At very high concurrency, decode‑side KV staging/reload causes CPU‑GPU traffic and throughput drop.
Requires a frozen shared prefill and fine‑tuning pipeline for decoders; not a drop‑in for black‑box hosted models.
When Not To Use
Your workloads rarely share long prompt prefixes across models.
You use third‑party closed models you cannot fine‑tune or host.
Failure Modes
Naive KV reuse without cache‑conditioned fine‑tuning collapses accuracy as sharing ratio increases.
Excessive CPU‑GPU KV staging under high concurrency reduces throughput despite high cache hit ratio.

