Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

February 1, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim

Links

Abstract / PDF

Why It Matters For Business

For multi-role LLM products, LRAgent cuts GPU memory and latency by sharing common KV content and keeping tiny per-role adapter state, enabling more agents or longer contexts with the same hardware.

Summary TLDR

Multi-role LLM agents often fine-tune lightweight LoRA adapters on a shared backbone, but each agent builds its own KV cache for long shared contexts. LRAgent splits the value cache into a shared base cache (from frozen pretrained weights) and small adapter-dependent low-rank (LR) caches. Two schemes—BaseShared (share base, keep per-agent LR) and BaseLRShared (share both under shared-A LoRA)—plus a Flash-LoRA-Attention kernel speed up inference and cut memory while keeping accuracy near the non-shared baseline on HotpotQA and ScienceQA.

Problem Statement

Multi-LLM agent systems using multi-LoRA adapters duplicate KV caches and recompute shared context per agent, causing high GPU memory use and redundant compute for long, tool-augmented trajectories.

Main Contribution

Decompose value KV cache into a shared base cache and a low-rank adapter cache (LR cache) for multi-LoRA agents.

Design BaseShared (share base cache, store per-agent LR caches) and BaseLRShared (share both base and LR caches under shared-A LoRA).

Introduce Flash-LoRA-Attention, a kernel that reorders attention math to expand LR cache at low rank and avoid full-dimension materialization.

Empirically show memory and latency improvements with small accuracy loss versus non-shared caching on HotpotQA and ScienceQA; release code.

Key Findings

Base cache activations are highly similar across agents while adapter outputs are near-orthogonal.

Numbersbase cache cosine ≈ 0.9726; adapter output cosine ≈ 0.0538 (LLaMA-3.1-8B)

BaseShared and BaseLRShared preserve accuracy close to non-shared caching.

NumbersAvg accuracy drop ≤ 0.7% (BaseShared) and ≤ 1.5% (BaseLRShared) vs Non-Shared

Memory and throughput gains are substantial at long contexts when using LR sharing and Flash-LoRA-Attention.

Numbers66.4k tokens: Non-Shared 39.84GB → BaseLRShared 23.35GB (−41%); throughput up to 2.46× with FLA

Time-to-first-token (TTFT) drops significantly with BaseLRShared.

NumbersTTFT reduction up to 4.44× compared to baseline

Results

base cache vs adapter cosine similarity

Valuebase ≈ 0.9726, adapter ≈ 0.0538 (LLaMA-3.1-8B)

Accuracy

ValueNon-Shared 38.88% → BaseShared 38.60% (−0.28) → BaseLRShared 37.92% (−0.97)

BaselineNon-Shared

memory usage (GB) at long context

ValueNon-Shared 39.84GB → BaseLRShared 23.35GB

BaselineNon-Shared

LoRA

Valueup to 2.46× (BaseLRShared with FLA)

BaselineBaseLRShared without FLA

TTFT reduction

ValueTTFT reduced up to 4.44× (BaseLRShared)

BaselineNon-Shared

Who Should Care

What To Try In 7 Days

Switch to BaseShared to share base KV cache and store per-role LR caches for immediate memory savings.

If training allows, adopt shared-A LoRA to enable BaseLRShared and reduce redundant prefill compute.

Run the authors' Flash-LoRA-Attention kernel or their implementation to test throughput and TTFT improvements on your long-context traces.

Agent Features

Memory

  • KV cache sharing
  • LR cache (low-rank intermediate activations)
  • hidden-state cache (baseline DroidSpeak comparison)

Planning

  • multi-hop planning (plan/action/reflect)

Tool Use

  • Web search API
  • Wikipedia lookup
  • image caption lookup

Frameworks

  • AutoAct

Is Agentic

true

Architectures

  • LoRA

Collaboration

  • role-specialized agents with shared backbone

Optimization Features

Token Efficiency

  • reduces redundant prefill across agents

Infra Optimization

  • lower GPU memory footprint for long contexts

Model Optimization

  • LoRA

System Optimization

  • attention-kernel level optimization to minimize LR expand cost

Training Optimization

  • shared-A down-projection to improve generalization

Inference Optimization

  • BaseShared and BaseLRShared caching strategies
  • LoRA

Reproducibility

Data Urls

  • HotpotQA: https://aclanthology.org/D18-1259/
  • ScienceQA: https://arxiv.org/abs/2210.XXXX (ScienceQA ref in paper)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • BaseLRShared requires shared-A LoRA (same down-projection A); without it accuracy can drop.
  • LR expansion still adds compute; gains depend on rank r and model head sizes.
  • Some baselines like FullShared can change behavior and harm accuracy; accuracy depends on dataset and agent setup.

When Not To Use

  • When you cannot standardize the LoRA down-projection A across agents.
  • If you cannot run a custom attention kernel (Flash-LoRA-Attention) on your infra.
  • For very small models where KV cache is not the bottleneck.

Failure Modes

  • Mismatched A matrices: sharing LR cache with different As causes large errors.
  • Hidden-state caching (other methods) can blow memory on GQA-style models and cause OOM.
  • If LoRA rank r is large, LR expansion cost reduces benefits.

Core Entities

Models

  • LLaMA-3.1-8B-Instruct
  • Ministral-8B-Instruct

Metrics

  • Accuracy
  • throughput (tokens/sec)
  • TTFT (s)
  • memory (GB)
  • cosine similarity

Datasets

  • HotpotQA
  • ScienceQA

Benchmarks

  • Multi-hop agent QA (HotpotQA split)
  • ScienceQA split

Context Entities

Models

  • LLaMA-2-70B-Chat (trajectory synthesis)

Datasets

  • AutoAct synthetic agent trajectories