Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.65
Citation Count
0
Why It Matters For Business
G‑Memory turns multi‑agent systems from static workflows into systems that learn from past team interactions, giving notable accuracy and success gains without redesigning existing MAS frameworks.
Summary TLDR
G-Memory is a plug‑and‑play hierarchical memory layer for LLM-based multi-agent systems (MAS). It stores past interactions at three levels—interaction (utterances), query (task records), and insight (distilled lessons)—and retrieves role‑specific memory via upward (to insights) and downward (to core trajectories) traversals. Across five benchmarks and three MAS frameworks, adding G-Memory improves embodied action success and QA accuracy substantially (up to +20.89% and +10.12% on evaluated benchmarks) while keeping token cost modest. The code is public.
Problem Statement
Current MAS lack expressive, agent‑aware memory. Existing memories are simplistic (only inside‑trial logs or final artifacts) and fail on long multi‑agent trajectories. This prevents MAS from learning from past collaborations and improving over time.
Main Contribution
Diagnose a bottleneck: multi-agent systems lack hierarchical, role‑aware cross‑trial memory.
Propose G-Memory: a three‑tier, graph‑based memory (insight, query, interaction) with bi‑directional traversal and agent‑specific filtering.
Extensive evaluation showing plug‑and‑play gains across three MAS frameworks, three LLMs, and five benchmarks, plus ablations and cost analysis.
Key Findings
Largest observed task gain: G-Memory raised an embodied action success rate by +20.89%.
Knowledge QA accuracy improved by up to +10.12% on evaluated datasets.
G‑Memory is token‑efficient versus alternatives.
Both high‑level insights and fine‑grained interactions matter; removing interactions hurts most.
Results
success rate
Accuracy
avg performance uplift (selected configs)
token cost vs performance
Who Should Care
What To Try In 7 Days
Integrate G‑Memory as a plug‑in to one MAS (AutoGen/DyLAN/MacNet) on a development branch.
Run a small replay of past tasks and inspect retrieved insight and trajectory snippets.
Compare end‑to‑end task success and token usage vs current memory baseline over 50 trials.
Agent Features
Memory
- interaction graph (utterances)
- query graph (task records and topology)
- insight graph (distilled lessons)
Planning
- division of labor
- task decomposition
- execution guidance from retrieved memory
Tool Use
- LLM‑facilitated graph sparsifier
- embedding retrieval with MiniLM
- role‑specific memory filtering
Frameworks
- AutoGen
- DyLAN
- MacNet
Is Agentic
true
Architectures
- three-tier hierarchical graph memory (insight/query/interaction)
Collaboration
- role-specific memory cues
- multi-agent coordination via retrieved lessons
Optimization Features
Token Efficiency
- retrieval + sparsification to limit context fed to LLMs
- 1‑hop expansion and small k (1–2) recommended
System Optimization
- hop expansion and LLM scoring for query relevance
- agent‑wise filtering Φ to assign role‑specific cues
Inference Optimization
- context sparsification of long trajectories
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to five benchmarks (embodied action, PDDL, HotpotQA, FEVER); more domains (e.g., medical QA) not tested.
- Performance depends on base LLM quality; retrieval and distillation can amplify LLM errors.
- Design choices (1‑hop, k∈{1,2}) were tuned; different domains may need retuning.
When Not To Use
- Small single‑agent tasks where MAS overhead outweighs benefit.
- Privacy‑sensitive settings where storing interaction transcripts is disallowed.
- Applications requiring strict, auditable provenance unless additional checks are added.
Failure Modes
- Retrieval noise from excessive hop expansion introducing irrelevant insights.
- Compressed interaction graphs missing critical dialogue steps (sparsifier errors).
- Memory amplifying hallucinated or adversarial LLM outputs.
Core Entities
Models
- Qwen-2.5-7b
- Qwen-2.5-14b
- gpt-4o-mini
- ALL-MINILM-L6-V2
Metrics
- success rate
- progress rate
- Accuracy
- token cost
Datasets
- ALFWorld
- ScienceWorld
- PDDL (AgentBoard)
- HotpotQA
- FEVER
Benchmarks
- ALFWorld
- SciWorld
- PDDL
- HotpotQA
- FEVER

