AriGraph: combine a semantic knowledge graph and episodic memory so an LLM agent remembers and plans across long, partially observed text‑en

July 5, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is practical: code is available and experiments show strong gains in text games and QA; real-world use needs engineering to handle noisy extraction, multimodal inputs, and production LLM costs.

Citations4

Evidence Strength0.78

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 75%

Authors

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, Evgeny Burnaev

Links

Abstract / PDF / Code

Why It Matters For Business

Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.

Who Should Care

Summary TLDR

AriGraph builds a dynamic memory graph that fuses a semantic knowledge graph (facts as triplets) with episodic vertices/edges (raw observations). An LLM agent (Ariadne) uses graph-based semantic search plus episodic lookup to plan and act in text games. AriGraph consistently outperforms unstructured memory baselines and strong RL baselines in TextWorld and achieves competitive multi-hop QA results while using far fewer tokens than some graph-RAG systems. Code is available.

Problem Statement

LLM agents need a long-term memory that supports structured retrieval, planning, and updates from interaction. Current unstructured memories (full history, RAG, summaries) scatter facts and limit planning in partially observed environments.

Main Contribution

AriGraph: a dynamic world model that stores semantic triplets (subject, relation, object) and episodic vertices/edges linking triplets to raw observations.

Ariadne agent: a cognitive pipeline that separates memory retrieval, planning (produces sub-goals), and ReAct-based decision making, using AriGraph for memory.

Key Findings

On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.

NumbersAriGraph 1.0 vs Full History 0.47 (Table 4)

Practical UseUse a structured semantic+episodic graph to solve long multi-room tasks; it outperforms storing raw history.

Evidence RefTable 4, Figure 3

With restricted local observations in NetHack, Ariadne using AriGraph reached 593±202 score vs 675±130 for an oracle agent with full level info.

NumbersAriadne(Room obs) 593±202 vs NetPlay(Level obs) 675±130 (Table 1)

Practical UseAriGraph can compensate for limited percepts; graph memory recovers much of the missing global state.

Evidence RefTable 1, Sec 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Treasure Hunt normalized score1.0 (AriGraph)Full History 0.47+0.53TextWorld Treasure Hunt (Table 4)AriGraph solved Treasure Hunt variants; Table 4Table 4
Cooking normalized score (hardest)0.65 (AriGraph)Summary 0.52 / RAG 0.36+0.13 vs SummaryTextWorld Cooking Hardest (Table 4)AriGraph retained higher success on complex multi-step tasks; Table 4Table 4

What To Try In 7 Days

Replace a simple vector DB memory with a compact semantic graph of facts for one agent task and measure retrieval accuracy.

Add episodic links (raw observations attached to facts) to help multi-step tasks where order matters.

Run a small QA workload comparing prompt token use between your current RAG setup and a graph-based retrieval of top triplets.

Agent Features

Memory
Semantic graph of triplets (subject, relation, object) — structured factsEpisodic vertices store raw observations and connect to tripletsGraph search: pretrained retriever + BFS-like expansion (depth d, width w)
Planning
Separate planner creates sub-goals from retrieved memoryReAct-style decision module executes actions and explains rationale
Tool Use
'go to location' navigation action derived from graph spatial relations
Frameworks
Contriever (edge retrieval)BGE-M3 (QA encoding)NetPlay pipeline for NetHack
Is Agentic

Yes

Architectures
Ariadne (planning + ReAct decision loop)AriGraph memory (semantic graph + episodic vertices/edges)

Optimization Features

Token Efficiency
Graph retrieval reduces prompt token usage vs full-context GraphRAG (11k vs 115k tokens)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Extraction depends on LLM quality; lower-quality LLMs build worse graphs (paper shows growth rate varies).

Evaluations are text-only; no multimodal sensors tested.

When Not To Use

Real-time or high-frequency sensor streams where graph extraction latency is too high.

Multimodal environments until multimodal extraction is added.

Failure Modes

Incorrect or missing triplets from noisy observations leads to wrong plans.

Overly aggressive outdated-triplet replacement can drop still-relevant facts.

Core Entities

Models

GPT-4gpt-4-0125-previewGPT-4oGPT-4o-miniLLaMA-3-70BContrieverBGE-M3

Metrics

normalized scoreEMF1average levels completedprompt/completion tokens

Datasets

TextWorldNetHackMuSiQueHotpotQA

Benchmarks

Text-based games (Treasure Hunt, Cooking, Cleaning)NetHackMulti-hop QA (MuSiQue, HotpotQA)