Overview
Production Readiness
0.6
Novelty Score
0.75
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.
Summary TLDR
AriGraph builds a dynamic memory graph that fuses a semantic knowledge graph (facts as triplets) with episodic vertices/edges (raw observations). An LLM agent (Ariadne) uses graph-based semantic search plus episodic lookup to plan and act in text games. AriGraph consistently outperforms unstructured memory baselines and strong RL baselines in TextWorld and achieves competitive multi-hop QA results while using far fewer tokens than some graph-RAG systems. Code is available.
Problem Statement
LLM agents need a long-term memory that supports structured retrieval, planning, and updates from interaction. Current unstructured memories (full history, RAG, summaries) scatter facts and limit planning in partially observed environments.
Main Contribution
AriGraph: a dynamic world model that stores semantic triplets (subject, relation, object) and episodic vertices/edges linking triplets to raw observations.
Ariadne agent: a cognitive pipeline that separates memory retrieval, planning (produces sub-goals), and ReAct-based decision making, using AriGraph for memory.
Empirical results showing AriGraph improves navigation, planning and exploration in TextWorld and NetHack and is competitive on multi-hop QA with much lower token cost than some baselines.
Key Findings
On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.
With restricted local observations in NetHack, Ariadne using AriGraph reached 593±202 score vs 675±130 for an oracle agent with full level info.
On HotpotQA AriGraph (GPT-4) obtained EM 68.0 and F1 74.7 and used ~11k prompt tokens versus GraphRAG ~115k tokens.
Results
Treasure Hunt normalized score
Cooking normalized score (hardest)
NetHack average score
HotpotQA
Prompt tokens (QA)
Who Should Care
What To Try In 7 Days
Replace a simple vector DB memory with a compact semantic graph of facts for one agent task and measure retrieval accuracy.
Add episodic links (raw observations attached to facts) to help multi-step tasks where order matters.
Run a small QA workload comparing prompt token use between your current RAG setup and a graph-based retrieval of top triplets.
Agent Features
Memory
- Semantic graph of triplets (subject, relation, object) — structured facts
- Episodic vertices store raw observations and connect to triplets
- Graph search: pretrained retriever + BFS-like expansion (depth d, width w)
Planning
- Separate planner creates sub-goals from retrieved memory
- ReAct-style decision module executes actions and explains rationale
Tool Use
- 'go to location' navigation action derived from graph spatial relations
Frameworks
- Contriever (edge retrieval)
- BGE-M3 (QA encoding)
- NetPlay pipeline for NetHack
Is Agentic
true
Architectures
- Ariadne (planning + ReAct decision loop)
- AriGraph memory (semantic graph + episodic vertices/edges)
Optimization Features
Token Efficiency
- Graph retrieval reduces prompt token usage vs full-context GraphRAG (11k vs 115k tokens)
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Extraction depends on LLM quality; lower-quality LLMs build worse graphs (paper shows growth rate varies).
- Evaluations are text-only; no multimodal sensors tested.
- Triplet extraction and replacement heuristics can miss or wrongly update facts (prompt-based parsing).
- Episodic edges are hyperedges which complicate some standard graph tooling.
When Not To Use
- Real-time or high-frequency sensor streams where graph extraction latency is too high.
- Multimodal environments until multimodal extraction is added.
- When only short, stateless tasks are needed — graph overhead may not pay off.
Failure Modes
- Incorrect or missing triplets from noisy observations leads to wrong plans.
- Overly aggressive outdated-triplet replacement can drop still-relevant facts.
- Graph growth and synonym proliferation cause retrieval noise if not normalized.
- LLM hallucinations during triplet extraction create false facts in the graph.
Core Entities
Models
- GPT-4
- gpt-4-0125-preview
- GPT-4o
- GPT-4o-mini
- LLaMA-3-70B
- Contriever
- BGE-M3
Metrics
- normalized score
- EM
- F1
- average levels completed
- prompt/completion tokens
Datasets
- TextWorld
- NetHack
- MuSiQue
- HotpotQA
Benchmarks
- Text-based games (Treasure Hunt, Cooking, Cleaning)
- NetHack
- Multi-hop QA (MuSiQue, HotpotQA)

