Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
For multi-role LLM products, LRAgent cuts GPU memory and latency by sharing common KV content and keeping tiny per-role adapter state, enabling more agents or longer contexts with the same hardware.
Summary TLDR
Multi-role LLM agents often fine-tune lightweight LoRA adapters on a shared backbone, but each agent builds its own KV cache for long shared contexts. LRAgent splits the value cache into a shared base cache (from frozen pretrained weights) and small adapter-dependent low-rank (LR) caches. Two schemes—BaseShared (share base, keep per-agent LR) and BaseLRShared (share both under shared-A LoRA)—plus a Flash-LoRA-Attention kernel speed up inference and cut memory while keeping accuracy near the non-shared baseline on HotpotQA and ScienceQA.
Problem Statement
Multi-LLM agent systems using multi-LoRA adapters duplicate KV caches and recompute shared context per agent, causing high GPU memory use and redundant compute for long, tool-augmented trajectories.
Main Contribution
Decompose value KV cache into a shared base cache and a low-rank adapter cache (LR cache) for multi-LoRA agents.
Design BaseShared (share base cache, store per-agent LR caches) and BaseLRShared (share both base and LR caches under shared-A LoRA).
Introduce Flash-LoRA-Attention, a kernel that reorders attention math to expand LR cache at low rank and avoid full-dimension materialization.
Empirically show memory and latency improvements with small accuracy loss versus non-shared caching on HotpotQA and ScienceQA; release code.
Key Findings
Base cache activations are highly similar across agents while adapter outputs are near-orthogonal.
BaseShared and BaseLRShared preserve accuracy close to non-shared caching.
Memory and throughput gains are substantial at long contexts when using LR sharing and Flash-LoRA-Attention.
Time-to-first-token (TTFT) drops significantly with BaseLRShared.
Results
base cache vs adapter cosine similarity
Accuracy
memory usage (GB) at long context
LoRA
TTFT reduction
Who Should Care
What To Try In 7 Days
Switch to BaseShared to share base KV cache and store per-role LR caches for immediate memory savings.
If training allows, adopt shared-A LoRA to enable BaseLRShared and reduce redundant prefill compute.
Run the authors' Flash-LoRA-Attention kernel or their implementation to test throughput and TTFT improvements on your long-context traces.
Agent Features
Memory
- KV cache sharing
- LR cache (low-rank intermediate activations)
- hidden-state cache (baseline DroidSpeak comparison)
Planning
- multi-hop planning (plan/action/reflect)
Tool Use
- Web search API
- Wikipedia lookup
- image caption lookup
Frameworks
- AutoAct
Is Agentic
true
Architectures
- LoRA
Collaboration
- role-specialized agents with shared backbone
Optimization Features
Token Efficiency
- reduces redundant prefill across agents
Infra Optimization
- lower GPU memory footprint for long contexts
Model Optimization
- LoRA
System Optimization
- attention-kernel level optimization to minimize LR expand cost
Training Optimization
- shared-A down-projection to improve generalization
Inference Optimization
- BaseShared and BaseLRShared caching strategies
- LoRA
Reproducibility
Code Urls
Data Urls
- HotpotQA: https://aclanthology.org/D18-1259/
- ScienceQA: https://arxiv.org/abs/2210.XXXX (ScienceQA ref in paper)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- BaseLRShared requires shared-A LoRA (same down-projection A); without it accuracy can drop.
- LR expansion still adds compute; gains depend on rank r and model head sizes.
- Some baselines like FullShared can change behavior and harm accuracy; accuracy depends on dataset and agent setup.
When Not To Use
- When you cannot standardize the LoRA down-projection A across agents.
- If you cannot run a custom attention kernel (Flash-LoRA-Attention) on your infra.
- For very small models where KV cache is not the bottleneck.
Failure Modes
- Mismatched A matrices: sharing LR cache with different As causes large errors.
- Hidden-state caching (other methods) can blow memory on GQA-style models and cause OOM.
- If LoRA rank r is large, LR expansion cost reduces benefits.
Core Entities
Models
- LLaMA-3.1-8B-Instruct
- Ministral-8B-Instruct
Metrics
- Accuracy
- throughput (tokens/sec)
- TTFT (s)
- memory (GB)
- cosine similarity
Datasets
- HotpotQA
- ScienceQA
Benchmarks
- Multi-hop agent QA (HotpotQA split)
- ScienceQA split
Context Entities
Models
- LLaMA-2-70B-Chat (trajectory synthesis)
Datasets
- AutoAct synthetic agent trajectories

