Replace flat context with a graph memory (TME) to cut hallucinations and save tokens in multi-step LLM agents

May 26, 20257 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and integrates with existing LLMs without fine-tuning. Evidence comes from controlled case studies (27 turns) and ablations, but results are limited to scripted scenarios and a single LLM (ChatGPT-4o).

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Ye Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TME reduces wrong or conflicting actions from LLM agents and lowers API token costs by pruning irrelevant history, making multi-step assistants more reliable and cheaper to run.

Who Should Care

Summary TLDR

TME is a lightweight memory controller that turns an off-the-shelf LLM into a revision-aware agent. It stores task state in a directed acyclic graph (TMS-DAG) and uses TRIM (an intent parser) to add, update, or check nodes. In four interactive scenarios (trip planning, cooking, meeting scheduling, cart editing) across 27 user turns, TME reduced hallucinations and confusions to 0 and cut token use by 19.4% vs a flat baseline. The code and benchmarks are open-source.

Problem Statement

LLM agents that concatenate conversation history often hallucinate, repeat actions, or misinterpret corrections because linear context cannot track evolving subtasks, dependencies, or revisions. We need a lightweight memory layer that preserves task state, supports edits, filters irrelevant history, and works without fine-tuning.

Main Contribution

Task Memory Engine (TME): a plug-in spatial memory that stores tasks as a DAG instead of linear context.

TRIM: an LLM-prompted intent and subtask parser that maps inputs to DAG operations (new, update, check).

Key Findings

TME eliminated observed hallucinations in the evaluated scenarios.

NumbersHallucinations: 0 vs ReAct 3 (27 turns)

Practical UseUse a graph memory for multi-step flows to avoid inconsistent summaries and contradictory updates in interactive agents.

Evidence RefTable 1, Section 6.1

TME eliminated observed confusions (misinterpreted checks vs updates) in the evaluated scenarios.

NumbersConfusions: 0 vs ReAct 5 (27 turns)

Practical UseClassify intents (new/update/check) before updating state to prevent accidental overwrites on clarifying user queries.

Evidence RefTable 1, Section 6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hallucinations (count)TME-DAG: 0ReAct: 3-327 user turns, 4 scenariosTable 1, Section 6.1Table 1
Confusions (count)TME-DAG: 0ReAct: 5-527 user turns, 4 scenariosTable 1, Section 6.1Table 1

What To Try In 7 Days

Prototype a DAG memory for one multi-turn flow (e.g., scheduling) and route LLM prompts through a retrieved subgraph.

Build a TRIM prompt (few-shot) to classify new/update/check intents and run A/B tests vs linear history.

Measure hallucination counts and token use over 20–50 turns to validate cost and correctness gains.

Agent Features

Memory
spatial memory (DAG) — stores tasks as nodesrevision history per node
Planning
task decompositiondependency-tracked revisions
Tool Use
LLM prompting (no fine-tuning required)few-shot intent parsing
Frameworks
TMETRIMTMS-DAG
Is Agentic

Yes

Architectures
modular memory controller (TME)TMS-DAG forest (directed acyclic graph)

Optimization Features

Token Efficiency
19.4% total token savings in 6-round form-filling (725 vs 899 tokens)up to 42.8% savings in correction-heavy rounds (Round 5 example)
System Optimization
adjacency-list DAG storage for memory efficiencynode-level history to avoid replaying full conversation
Inference Optimization
compact subgraph retrieval to reduce prompt sizeLLM prompts at low temperature (0.3) for deterministic parsing

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

TRIM currently relies on few-shot LLM prompts and can be brittle on ambiguous multi-intent inputs.

TMS-DAG caused residual inactive nodes in the cart case before a task-specific TRIM adaptation.

When Not To Use

Simple linear tasks where a flat history suffices (e.g., basic cart edits) and added DAG complexity adds overhead.

Latency-critical paths where extra intent classification and graph ops would slow responses.

Failure Modes

Misclassified intents (TRIM errors) create wrong node updates or missing updates.

DAG dependency conflicts can leave obsolete nodes active, producing hallucinated items.

Core Entities

Models

ChatGPT-4o

Metrics

Hallucinations (count)Confusions (count)Token usage (tokens, % savings)Task consistency (consistent tasks / total tasks)

Datasets

27-turn custom multi-scenario benchmark (trip, cooking, meeting, cart)

Benchmarks

Case-study benchmark across 4 scenarios (27 user turns)

Context Entities

Models

ReAct (baseline prompting style)Chain-of-Thought (baseline)base-flat (linear history baseline)