Overview
The method is practical for multi-LoRA agent pipelines: share base cache first, adopt shared-A to unlock full compute savings, and enable Flash-LoRA-Attention for best throughput and latency.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
For multi-role LLM products, LRAgent cuts GPU memory and latency by sharing common KV content and keeping tiny per-role adapter state, enabling more agents or longer contexts with the same hardware.
Who Should Care
Summary TLDR
Multi-role LLM agents often fine-tune lightweight LoRA adapters on a shared backbone, but each agent builds its own KV cache for long shared contexts. LRAgent splits the value cache into a shared base cache (from frozen pretrained weights) and small adapter-dependent low-rank (LR) caches. Two schemes—BaseShared (share base, keep per-agent LR) and BaseLRShared (share both under shared-A LoRA)—plus a Flash-LoRA-Attention kernel speed up inference and cut memory while keeping accuracy near the non-shared baseline on HotpotQA and ScienceQA.
Problem Statement
Multi-LLM agent systems using multi-LoRA adapters duplicate KV caches and recompute shared context per agent, causing high GPU memory use and redundant compute for long, tool-augmented trajectories.
Main Contribution
Decompose value KV cache into a shared base cache and a low-rank adapter cache (LR cache) for multi-LoRA agents.
Design BaseShared (share base cache, store per-agent LR caches) and BaseLRShared (share both base and LR caches under shared-A LoRA).
Key Findings
Base cache activations are highly similar across agents while adapter outputs are near-orthogonal.
BaseShared and BaseLRShared preserve accuracy close to non-shared caching.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| base cache vs adapter cosine similarity | base ≈ 0.9726, adapter ≈ 0.0538 (LLaMA-3.1-8B) | — | — | analysis on same context | Table 1 and Section 3.1 | Table 1 |
| Accuracy | Non-Shared 38.88% → BaseShared 38.60% (−0.28) → BaseLRShared 37.92% (−0.97) | Non-Shared | BaseShared −0.28, BaseLRShared −0.97 | HotpotQA (LLaMA-3.1-8B-Instruct) | Table 3 (HotpotQA) | Table 3 |
What To Try In 7 Days
Switch to BaseShared to share base KV cache and store per-role LR caches for immediate memory savings.
If training allows, adopt shared-A LoRA to enable BaseLRShared and reduce redundant prefill compute.
Run the authors' Flash-LoRA-Attention kernel or their implementation to test throughput and TTFT improvements on your long-context traces.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
BaseLRShared requires shared-A LoRA (same down-projection A); without it accuracy can drop.
LR expansion still adds compute; gains depend on rank r and model head sizes.
When Not To Use
When you cannot standardize the LoRA down-projection A across agents.
If you cannot run a custom attention kernel (Flash-LoRA-Attention) on your infra.
Failure Modes
Mismatched A matrices: sharing LR cache with different As causes large errors.
Hidden-state caching (other methods) can blow memory on GQA-style models and cause OOM.

