G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

June 9, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.65

Citation Count

0

Authors

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan

Links

Abstract / PDF

Why It Matters For Business

G‑Memory turns multi‑agent systems from static workflows into systems that learn from past team interactions, giving notable accuracy and success gains without redesigning existing MAS frameworks.

Summary TLDR

G-Memory is a plug‑and‑play hierarchical memory layer for LLM-based multi-agent systems (MAS). It stores past interactions at three levels—interaction (utterances), query (task records), and insight (distilled lessons)—and retrieves role‑specific memory via upward (to insights) and downward (to core trajectories) traversals. Across five benchmarks and three MAS frameworks, adding G-Memory improves embodied action success and QA accuracy substantially (up to +20.89% and +10.12% on evaluated benchmarks) while keeping token cost modest. The code is public.

Problem Statement

Current MAS lack expressive, agent‑aware memory. Existing memories are simplistic (only inside‑trial logs or final artifacts) and fail on long multi‑agent trajectories. This prevents MAS from learning from past collaborations and improving over time.

Main Contribution

Diagnose a bottleneck: multi-agent systems lack hierarchical, role‑aware cross‑trial memory.

Propose G-Memory: a three‑tier, graph‑based memory (insight, query, interaction) with bi‑directional traversal and agent‑specific filtering.

Extensive evaluation showing plug‑and‑play gains across three MAS frameworks, three LLMs, and five benchmarks, plus ablations and cost analysis.

Key Findings

Largest observed task gain: G-Memory raised an embodied action success rate by +20.89%.

NumbersALFWorld (MacNet, Qwen‑2.5‑14b): 58.21% → 79.10% (+20.89%)

Knowledge QA accuracy improved by up to +10.12% on evaluated datasets.

NumbersHotpotQA (AutoGen, Qwen‑2.5‑14b): 24.49% → 34.61% (+10.12%)

G‑Memory is token‑efficient versus alternatives.

NumbersPDDL+AutoGen: +10.32% perf with ~1.4e6 extra tokens vs MetaGPT‑M which used +2.2e6 tokens for +4.07% gain

Both high‑level insights and fine‑grained interactions matter; removing interactions hurts most.

NumbersAblation: removing interactions → average −4.47% (AutoGen) vs removing insights → −3.95%

Results

success rate

Value79.10%

Baseline58.21%

Accuracy

Value34.61%

Baseline24.49%

avg performance uplift (selected configs)

Value+6.8%

Baselinebest single/multi-agent baselines

token cost vs performance

Value~+1.4e6 tokens

BaselineNo-memory / alternatives

Who Should Care

What To Try In 7 Days

Integrate G‑Memory as a plug‑in to one MAS (AutoGen/DyLAN/MacNet) on a development branch.

Run a small replay of past tasks and inspect retrieved insight and trajectory snippets.

Compare end‑to‑end task success and token usage vs current memory baseline over 50 trials.

Agent Features

Memory

  • interaction graph (utterances)
  • query graph (task records and topology)
  • insight graph (distilled lessons)

Planning

  • division of labor
  • task decomposition
  • execution guidance from retrieved memory

Tool Use

  • LLM‑facilitated graph sparsifier
  • embedding retrieval with MiniLM
  • role‑specific memory filtering

Frameworks

  • AutoGen
  • DyLAN
  • MacNet

Is Agentic

true

Architectures

  • three-tier hierarchical graph memory (insight/query/interaction)

Collaboration

  • role-specific memory cues
  • multi-agent coordination via retrieved lessons

Optimization Features

Token Efficiency

  • retrieval + sparsification to limit context fed to LLMs
  • 1‑hop expansion and small k (1–2) recommended

System Optimization

  • hop expansion and LLM scoring for query relevance
  • agent‑wise filtering Φ to assign role‑specific cues

Inference Optimization

  • context sparsification of long trajectories

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to five benchmarks (embodied action, PDDL, HotpotQA, FEVER); more domains (e.g., medical QA) not tested.
  • Performance depends on base LLM quality; retrieval and distillation can amplify LLM errors.
  • Design choices (1‑hop, k∈{1,2}) were tuned; different domains may need retuning.

When Not To Use

  • Small single‑agent tasks where MAS overhead outweighs benefit.
  • Privacy‑sensitive settings where storing interaction transcripts is disallowed.
  • Applications requiring strict, auditable provenance unless additional checks are added.

Failure Modes

  • Retrieval noise from excessive hop expansion introducing irrelevant insights.
  • Compressed interaction graphs missing critical dialogue steps (sparsifier errors).
  • Memory amplifying hallucinated or adversarial LLM outputs.

Core Entities

Models

  • Qwen-2.5-7b
  • Qwen-2.5-14b
  • gpt-4o-mini
  • ALL-MINILM-L6-V2

Metrics

  • success rate
  • progress rate
  • Accuracy
  • token cost

Datasets

  • ALFWorld
  • ScienceWorld
  • PDDL (AgentBoard)
  • HotpotQA
  • FEVER

Benchmarks

  • ALFWorld
  • SciWorld
  • PDDL
  • HotpotQA
  • FEVER