Overview
The method is practical: code is available and experiments show strong gains in text games and QA; real-world use needs engineering to handle noisy extraction, multimodal inputs, and production LLM costs.
Citations4
Evidence Strength0.78
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 75%
Why It Matters For Business
Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.
Who Should Care
Summary TLDR
AriGraph builds a dynamic memory graph that fuses a semantic knowledge graph (facts as triplets) with episodic vertices/edges (raw observations). An LLM agent (Ariadne) uses graph-based semantic search plus episodic lookup to plan and act in text games. AriGraph consistently outperforms unstructured memory baselines and strong RL baselines in TextWorld and achieves competitive multi-hop QA results while using far fewer tokens than some graph-RAG systems. Code is available.
Problem Statement
LLM agents need a long-term memory that supports structured retrieval, planning, and updates from interaction. Current unstructured memories (full history, RAG, summaries) scatter facts and limit planning in partially observed environments.
Main Contribution
AriGraph: a dynamic world model that stores semantic triplets (subject, relation, object) and episodic vertices/edges linking triplets to raw observations.
Ariadne agent: a cognitive pipeline that separates memory retrieval, planning (produces sub-goals), and ReAct-based decision making, using AriGraph for memory.
Key Findings
On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.
With restricted local observations in NetHack, Ariadne using AriGraph reached 593±202 score vs 675±130 for an oracle agent with full level info.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Treasure Hunt normalized score | 1.0 (AriGraph) | Full History 0.47 | +0.53 | TextWorld Treasure Hunt (Table 4) | AriGraph solved Treasure Hunt variants; Table 4 | Table 4 |
| Cooking normalized score (hardest) | 0.65 (AriGraph) | Summary 0.52 / RAG 0.36 | +0.13 vs Summary | TextWorld Cooking Hardest (Table 4) | AriGraph retained higher success on complex multi-step tasks; Table 4 | Table 4 |
What To Try In 7 Days
Replace a simple vector DB memory with a compact semantic graph of facts for one agent task and measure retrieval accuracy.
Add episodic links (raw observations attached to facts) to help multi-step tasks where order matters.
Run a small QA workload comparing prompt token use between your current RAG setup and a graph-based retrieval of top triplets.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Extraction depends on LLM quality; lower-quality LLMs build worse graphs (paper shows growth rate varies).
Evaluations are text-only; no multimodal sensors tested.
When Not To Use
Real-time or high-frequency sensor streams where graph extraction latency is too high.
Multimodal environments until multimodal extraction is added.
Failure Modes
Incorrect or missing triplets from noisy observations leads to wrong plans.
Overly aggressive outdated-triplet replacement can drop still-relevant facts.

