Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Overview

Decision SnapshotReady For Pilot

The method is practical for multi-LoRA agent pipelines: share base cache first, adopt shared-A to unlock full compute savings, and enable Flash-LoRA-Attention for best throughput and latency.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

For multi-role LLM products, LRAgent cuts GPU memory and latency by sharing common KV content and keeping tiny per-role adapter state, enabling more agents or longer contexts with the same hardware.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

Multi-role LLM agents often fine-tune lightweight LoRA adapters on a shared backbone, but each agent builds its own KV cache for long shared contexts. LRAgent splits the value cache into a shared base cache (from frozen pretrained weights) and small adapter-dependent low-rank (LR) caches. Two schemes—BaseShared (share base, keep per-agent LR) and BaseLRShared (share both under shared-A LoRA)—plus a Flash-LoRA-Attention kernel speed up inference and cut memory while keeping accuracy near the non-shared baseline on HotpotQA and ScienceQA.

Problem Statement

Multi-LLM agent systems using multi-LoRA adapters duplicate KV caches and recompute shared context per agent, causing high GPU memory use and redundant compute for long, tool-augmented trajectories.

Main Contribution

Decompose value KV cache into a shared base cache and a low-rank adapter cache (LR cache) for multi-LoRA agents.

Design BaseShared (share base cache, store per-agent LR caches) and BaseLRShared (share both base and LR caches under shared-A LoRA).

Key Findings

Base cache activations are highly similar across agents while adapter outputs are near-orthogonal.

Numbersbase cache cosine ≈ 0.9726; adapter output cosine ≈ 0.0538 (LLaMA-3.1-8B)

Practical UseShare the base cache across agents and keep only compact adapter LR caches to preserve agent-specific behavior with little recomputation.

Evidence RefTable 1

BaseShared and BaseLRShared preserve accuracy close to non-shared caching.

NumbersAvg accuracy drop ≤ 0.7% (BaseShared) and ≤ 1.5% (BaseLRShared) vs Non-Shared

Practical UseUse BaseShared for memory-critical deployments; use BaseLRShared when you can adopt shared-A LoRA to also cut compute.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
base cache vs adapter cosine similarity	base ≈ 0.9726, adapter ≈ 0.0538 (LLaMA-3.1-8B)	—	—	analysis on same context	Table 1 and Section 3.1	Table 1
Accuracy	Non-Shared 38.88% → BaseShared 38.60% (−0.28) → BaseLRShared 37.92% (−0.97)	Non-Shared	BaseShared −0.28, BaseLRShared −0.97	HotpotQA (LLaMA-3.1-8B-Instruct)	Table 3 (HotpotQA)	Table 3

What To Try In 7 Days

Switch to BaseShared to share base KV cache and store per-role LR caches for immediate memory savings.

If training allows, adopt shared-A LoRA to enable BaseLRShared and reduce redundant prefill compute.

Run the authors' Flash-LoRA-Attention kernel or their implementation to test throughput and TTFT improvements on your long-context traces.

Agent Features

Memory

KV cache sharingLR cache (low-rank intermediate activations)hidden-state cache (baseline DroidSpeak comparison)

Planning

multi-hop planning (plan/action/reflect)

Tool Use

Web search APIWikipedia lookupimage caption lookup

Frameworks

AutoAct

Is Agentic

Yes

Architectures

LoRA

Collaboration

role-specialized agents with shared backbone

Optimization Features

Token Efficiency

reduces redundant prefill across agents

Infra Optimization

lower GPU memory footprint for long contexts

Model Optimization

LoRA

System Optimization

attention-kernel level optimization to minimize LR expand cost

Training Optimization

shared-A down-projection to improve generalization

Inference Optimization

BaseShared and BaseLRShared caching strategiesLoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/hjeon2k/LRAgent

Data URLs

HotpotQA: https://aclanthology.org/D18-1259/ScienceQA: https://arxiv.org/abs/2210.XXXX (ScienceQA ref in paper)

Risks & Boundaries

Limitations

BaseLRShared requires shared-A LoRA (same down-projection A); without it accuracy can drop.

LR expansion still adds compute; gains depend on rank r and model head sizes.

When Not To Use

When you cannot standardize the LoRA down-projection A across agents.

If you cannot run a custom attention kernel (Flash-LoRA-Attention) on your infra.

Failure Modes

Mismatched A matrices: sharing LR cache with different As causes large errors.

Hidden-state caching (other methods) can blow memory on GQA-style models and cause OOM.

Core Entities

Models

LLaMA-3.1-8B-InstructMinistral-8B-Instruct

Metrics

Accuracythroughput (tokens/sec)TTFT (s)memory (GB)cosine similarity

Datasets

HotpotQAScienceQA

Benchmarks

Multi-hop agent QA (HotpotQA split)ScienceQA split

Context Entities

Models

LLaMA-2-70B-Chat (trajectory synthesis)

Datasets

AutoAct synthetic agent trajectories

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Base cache activations are highly similar across agents while adapter outputs are near-orthogonal.

BaseShared and BaseLRShared preserve accuracy close to non-shared caching.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding