Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

February 1, 20267 min

Overview

Decision SnapshotReady For Pilot

The method is practical for multi-LoRA agent pipelines: share base cache first, adopt shared-A to unlock full compute savings, and enable Flash-LoRA-Attention for best throughput and latency.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

For multi-role LLM products, LRAgent cuts GPU memory and latency by sharing common KV content and keeping tiny per-role adapter state, enabling more agents or longer contexts with the same hardware.

Who Should Care

Summary TLDR

Multi-role LLM agents often fine-tune lightweight LoRA adapters on a shared backbone, but each agent builds its own KV cache for long shared contexts. LRAgent splits the value cache into a shared base cache (from frozen pretrained weights) and small adapter-dependent low-rank (LR) caches. Two schemes—BaseShared (share base, keep per-agent LR) and BaseLRShared (share both under shared-A LoRA)—plus a Flash-LoRA-Attention kernel speed up inference and cut memory while keeping accuracy near the non-shared baseline on HotpotQA and ScienceQA.

Problem Statement

Multi-LLM agent systems using multi-LoRA adapters duplicate KV caches and recompute shared context per agent, causing high GPU memory use and redundant compute for long, tool-augmented trajectories.

Main Contribution

Decompose value KV cache into a shared base cache and a low-rank adapter cache (LR cache) for multi-LoRA agents.

Design BaseShared (share base cache, store per-agent LR caches) and BaseLRShared (share both base and LR caches under shared-A LoRA).

Key Findings

Base cache activations are highly similar across agents while adapter outputs are near-orthogonal.

Numbersbase cache cosine ≈ 0.9726; adapter output cosine ≈ 0.0538 (LLaMA-3.1-8B)

Practical UseShare the base cache across agents and keep only compact adapter LR caches to preserve agent-specific behavior with little recomputation.

Evidence RefTable 1

BaseShared and BaseLRShared preserve accuracy close to non-shared caching.

NumbersAvg accuracy drop ≤ 0.7% (BaseShared) and ≤ 1.5% (BaseLRShared) vs Non-Shared

Practical UseUse BaseShared for memory-critical deployments; use BaseLRShared when you can adopt shared-A LoRA to also cut compute.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
base cache vs adapter cosine similaritybase ≈ 0.9726, adapter ≈ 0.0538 (LLaMA-3.1-8B)analysis on same contextTable 1 and Section 3.1Table 1
AccuracyNon-Shared 38.88% → BaseShared 38.60% (−0.28) → BaseLRShared 37.92% (−0.97)Non-SharedBaseShared −0.28, BaseLRShared −0.97HotpotQA (LLaMA-3.1-8B-Instruct)Table 3 (HotpotQA)Table 3

What To Try In 7 Days

Switch to BaseShared to share base KV cache and store per-role LR caches for immediate memory savings.

If training allows, adopt shared-A LoRA to enable BaseLRShared and reduce redundant prefill compute.

Run the authors' Flash-LoRA-Attention kernel or their implementation to test throughput and TTFT improvements on your long-context traces.

Agent Features

Memory
KV cache sharingLR cache (low-rank intermediate activations)hidden-state cache (baseline DroidSpeak comparison)
Planning
multi-hop planning (plan/action/reflect)
Tool Use
Web search APIWikipedia lookupimage caption lookup
Frameworks
AutoAct
Is Agentic

Yes

Architectures
LoRA
Collaboration
role-specialized agents with shared backbone

Optimization Features

Token Efficiency
reduces redundant prefill across agents
Infra Optimization
lower GPU memory footprint for long contexts
Model Optimization
LoRA
System Optimization
attention-kernel level optimization to minimize LR expand cost
Training Optimization
shared-A down-projection to improve generalization
Inference Optimization
BaseShared and BaseLRShared caching strategiesLoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

HotpotQA: https://aclanthology.org/D18-1259/ScienceQA: https://arxiv.org/abs/2210.XXXX (ScienceQA ref in paper)

Risks & Boundaries

Limitations

BaseLRShared requires shared-A LoRA (same down-projection A); without it accuracy can drop.

LR expansion still adds compute; gains depend on rank r and model head sizes.

When Not To Use

When you cannot standardize the LoRA down-projection A across agents.

If you cannot run a custom attention kernel (Flash-LoRA-Attention) on your infra.

Failure Modes

Mismatched A matrices: sharing LR cache with different As causes large errors.

Hidden-state caching (other methods) can blow memory on GQA-style models and cause OOM.

Core Entities

Models

LLaMA-3.1-8B-InstructMinistral-8B-Instruct

Metrics

Accuracythroughput (tokens/sec)TTFT (s)memory (GB)cosine similarity

Datasets

HotpotQAScienceQA

Benchmarks

Multi-hop agent QA (HotpotQA split)ScienceQA split

Context Entities

Models

LLaMA-2-70B-Chat (trajectory synthesis)

Datasets

AutoAct synthetic agent trajectories