AriGraph: combine a semantic knowledge graph and episodic memory so an LLM agent remembers and plans across long, partially observed text‑en

Overview

Decision SnapshotNeeds Validation

The method is practical: code is available and experiments show strong gains in text games and QA; real-world use needs engineering to handle noisy extraction, multimodal inputs, and production LLM costs.

Citations4

Evidence Strength0.78

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 75%

Authors

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, Evgeny Burnaev

Links

Abstract / PDF / Code

Why It Matters For Business

Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

AriGraph builds a dynamic memory graph that fuses a semantic knowledge graph (facts as triplets) with episodic vertices/edges (raw observations). An LLM agent (Ariadne) uses graph-based semantic search plus episodic lookup to plan and act in text games. AriGraph consistently outperforms unstructured memory baselines and strong RL baselines in TextWorld and achieves competitive multi-hop QA results while using far fewer tokens than some graph-RAG systems. Code is available.

Problem Statement

LLM agents need a long-term memory that supports structured retrieval, planning, and updates from interaction. Current unstructured memories (full history, RAG, summaries) scatter facts and limit planning in partially observed environments.

Main Contribution

AriGraph: a dynamic world model that stores semantic triplets (subject, relation, object) and episodic vertices/edges linking triplets to raw observations.

Ariadne agent: a cognitive pipeline that separates memory retrieval, planning (produces sub-goals), and ReAct-based decision making, using AriGraph for memory.

Key Findings

On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.

NumbersAriGraph 1.0 vs Full History 0.47 (Table 4)

Practical UseUse a structured semantic+episodic graph to solve long multi-room tasks; it outperforms storing raw history.

Evidence RefTable 4, Figure 3

With restricted local observations in NetHack, Ariadne using AriGraph reached 593±202 score vs 675±130 for an oracle agent with full level info.

NumbersAriadne(Room obs) 593±202 vs NetPlay(Level obs) 675±130 (Table 1)

Practical UseAriGraph can compensate for limited percepts; graph memory recovers much of the missing global state.

Evidence RefTable 1, Sec 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Treasure Hunt normalized score	1.0 (AriGraph)	Full History 0.47	+0.53	TextWorld Treasure Hunt (Table 4)	AriGraph solved Treasure Hunt variants; Table 4	Table 4
Cooking normalized score (hardest)	0.65 (AriGraph)	Summary 0.52 / RAG 0.36	+0.13 vs Summary	TextWorld Cooking Hardest (Table 4)	AriGraph retained higher success on complex multi-step tasks; Table 4	Table 4

What To Try In 7 Days

Replace a simple vector DB memory with a compact semantic graph of facts for one agent task and measure retrieval accuracy.

Add episodic links (raw observations attached to facts) to help multi-step tasks where order matters.

Run a small QA workload comparing prompt token use between your current RAG setup and a graph-based retrieval of top triplets.

Agent Features

Memory

Semantic graph of triplets (subject, relation, object) — structured factsEpisodic vertices store raw observations and connect to tripletsGraph search: pretrained retriever + BFS-like expansion (depth d, width w)

Planning

Separate planner creates sub-goals from retrieved memoryReAct-style decision module executes actions and explains rationale

Tool Use

'go to location' navigation action derived from graph spatial relations

Frameworks

Contriever (edge retrieval)BGE-M3 (QA encoding)NetPlay pipeline for NetHack

Is Agentic

Yes

Architectures

Ariadne (planning + ReAct decision loop)AriGraph memory (semantic graph + episodic vertices/edges)

Optimization Features

Token Efficiency

Graph retrieval reduces prompt token usage vs full-context GraphRAG (11k vs 115k tokens)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AIRI-Institute/AriGraph

Risks & Boundaries

Limitations

Extraction depends on LLM quality; lower-quality LLMs build worse graphs (paper shows growth rate varies).

Evaluations are text-only; no multimodal sensors tested.

When Not To Use

Real-time or high-frequency sensor streams where graph extraction latency is too high.

Multimodal environments until multimodal extraction is added.

Failure Modes

Incorrect or missing triplets from noisy observations leads to wrong plans.

Overly aggressive outdated-triplet replacement can drop still-relevant facts.

Core Entities

Models

GPT-4gpt-4-0125-previewGPT-4oGPT-4o-miniLLaMA-3-70BContrieverBGE-M3

Metrics

normalized scoreEMF1average levels completedprompt/completion tokens

Datasets

TextWorldNetHackMuSiQueHotpotQA

Benchmarks

Text-based games (Treasure Hunt, Cooking, Cleaning)NetHackMulti-hop QA (MuSiQue, HotpotQA)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.

With restricted local observations in NetHack, Ariadne using AriGraph reached 593±202 score vs 675±130 for an oracle agent with full level info.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding