AriGraph: combine a semantic knowledge graph and episodic memory so an LLM agent remembers and plans across long, partially observed text‑en

July 5, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.75

Cost Impact Score

0.7

Citation Count

4

Authors

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, Evgeny Burnaev

Links

Abstract / PDF

Why It Matters For Business

Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.

Summary TLDR

AriGraph builds a dynamic memory graph that fuses a semantic knowledge graph (facts as triplets) with episodic vertices/edges (raw observations). An LLM agent (Ariadne) uses graph-based semantic search plus episodic lookup to plan and act in text games. AriGraph consistently outperforms unstructured memory baselines and strong RL baselines in TextWorld and achieves competitive multi-hop QA results while using far fewer tokens than some graph-RAG systems. Code is available.

Problem Statement

LLM agents need a long-term memory that supports structured retrieval, planning, and updates from interaction. Current unstructured memories (full history, RAG, summaries) scatter facts and limit planning in partially observed environments.

Main Contribution

AriGraph: a dynamic world model that stores semantic triplets (subject, relation, object) and episodic vertices/edges linking triplets to raw observations.

Ariadne agent: a cognitive pipeline that separates memory retrieval, planning (produces sub-goals), and ReAct-based decision making, using AriGraph for memory.

Empirical results showing AriGraph improves navigation, planning and exploration in TextWorld and NetHack and is competitive on multi-hop QA with much lower token cost than some baselines.

Key Findings

On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.

NumbersAriGraph 1.0 vs Full History 0.47 (Table 4)

With restricted local observations in NetHack, Ariadne using AriGraph reached 593±202 score vs 675±130 for an oracle agent with full level info.

NumbersAriadne(Room obs) 593±202 vs NetPlay(Level obs) 675±130 (Table 1)

On HotpotQA AriGraph (GPT-4) obtained EM 68.0 and F1 74.7 and used ~11k prompt tokens versus GraphRAG ~115k tokens.

NumbersHotpotQA EM 68.0 F1 74.7; prompt tokens AriGraph 11k vs GraphRAG 115k (Table 2, Table 3)

Results

Treasure Hunt normalized score

Value1.0 (AriGraph)

BaselineFull History 0.47

Cooking normalized score (hardest)

Value0.65 (AriGraph)

BaselineSummary 0.52 / RAG 0.36

NetHack average score

Value593.00 ± 202.62 (Ariadne Room obs)

BaselineNetPlay Level obs 675.33 ± 130.27

HotpotQA

ValueEM 68.0, F1 74.7 (AriGraph, GPT-4)

BaselineHOLMES GPT-4 EM 66.0 F1 78.0

Prompt tokens (QA)

Value11k tokens (AriGraph)

Baseline115k tokens (GraphRAG)

Who Should Care

What To Try In 7 Days

Replace a simple vector DB memory with a compact semantic graph of facts for one agent task and measure retrieval accuracy.

Add episodic links (raw observations attached to facts) to help multi-step tasks where order matters.

Run a small QA workload comparing prompt token use between your current RAG setup and a graph-based retrieval of top triplets.

Agent Features

Memory

  • Semantic graph of triplets (subject, relation, object) — structured facts
  • Episodic vertices store raw observations and connect to triplets
  • Graph search: pretrained retriever + BFS-like expansion (depth d, width w)

Planning

  • Separate planner creates sub-goals from retrieved memory
  • ReAct-style decision module executes actions and explains rationale

Tool Use

  • 'go to location' navigation action derived from graph spatial relations

Frameworks

  • Contriever (edge retrieval)
  • BGE-M3 (QA encoding)
  • NetPlay pipeline for NetHack

Is Agentic

true

Architectures

  • Ariadne (planning + ReAct decision loop)
  • AriGraph memory (semantic graph + episodic vertices/edges)

Optimization Features

Token Efficiency

  • Graph retrieval reduces prompt token usage vs full-context GraphRAG (11k vs 115k tokens)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Extraction depends on LLM quality; lower-quality LLMs build worse graphs (paper shows growth rate varies).
  • Evaluations are text-only; no multimodal sensors tested.
  • Triplet extraction and replacement heuristics can miss or wrongly update facts (prompt-based parsing).
  • Episodic edges are hyperedges which complicate some standard graph tooling.

When Not To Use

  • Real-time or high-frequency sensor streams where graph extraction latency is too high.
  • Multimodal environments until multimodal extraction is added.
  • When only short, stateless tasks are needed — graph overhead may not pay off.

Failure Modes

  • Incorrect or missing triplets from noisy observations leads to wrong plans.
  • Overly aggressive outdated-triplet replacement can drop still-relevant facts.
  • Graph growth and synonym proliferation cause retrieval noise if not normalized.
  • LLM hallucinations during triplet extraction create false facts in the graph.

Core Entities

Models

  • GPT-4
  • gpt-4-0125-preview
  • GPT-4o
  • GPT-4o-mini
  • LLaMA-3-70B
  • Contriever
  • BGE-M3

Metrics

  • normalized score
  • EM
  • F1
  • average levels completed
  • prompt/completion tokens

Datasets

  • TextWorld
  • NetHack
  • MuSiQue
  • HotpotQA

Benchmarks

  • Text-based games (Treasure Hunt, Cooking, Cleaning)
  • NetHack
  • Multi-hop QA (MuSiQue, HotpotQA)