Replace flat context with a graph memory (TME) to cut hallucinations and save tokens in multi-step LLM agents

Overview

Decision SnapshotNeeds Validation

The idea is practical and integrates with existing LLMs without fine-tuning. Evidence comes from controlled case studies (27 turns) and ablations, but results are limited to scripted scenarios and a single LLM (ChatGPT-4o).

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Ye Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TME reduces wrong or conflicting actions from LLM agents and lowers API token costs by pruning irrelevant history, making multi-step assistants more reliable and cheaper to run.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

TME is a lightweight memory controller that turns an off-the-shelf LLM into a revision-aware agent. It stores task state in a directed acyclic graph (TMS-DAG) and uses TRIM (an intent parser) to add, update, or check nodes. In four interactive scenarios (trip planning, cooking, meeting scheduling, cart editing) across 27 user turns, TME reduced hallucinations and confusions to 0 and cut token use by 19.4% vs a flat baseline. The code and benchmarks are open-source.

Problem Statement

LLM agents that concatenate conversation history often hallucinate, repeat actions, or misinterpret corrections because linear context cannot track evolving subtasks, dependencies, or revisions. We need a lightweight memory layer that preserves task state, supports edits, filters irrelevant history, and works without fine-tuning.

Main Contribution

Task Memory Engine (TME): a plug-in spatial memory that stores tasks as a DAG instead of linear context.

TRIM: an LLM-prompted intent and subtask parser that maps inputs to DAG operations (new, update, check).

Key Findings

TME eliminated observed hallucinations in the evaluated scenarios.

NumbersHallucinations: 0 vs ReAct 3 (27 turns)

Practical UseUse a graph memory for multi-step flows to avoid inconsistent summaries and contradictory updates in interactive agents.

Evidence RefTable 1, Section 6.1

TME eliminated observed confusions (misinterpreted checks vs updates) in the evaluated scenarios.

NumbersConfusions: 0 vs ReAct 5 (27 turns)

Practical UseClassify intents (new/update/check) before updating state to prevent accidental overwrites on clarifying user queries.

Evidence RefTable 1, Section 6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hallucinations (count)	TME-DAG: 0	ReAct: 3	-3	27 user turns, 4 scenarios	Table 1, Section 6.1	Table 1
Confusions (count)	TME-DAG: 0	ReAct: 5	-5	27 user turns, 4 scenarios	Table 1, Section 6.1	Table 1

What To Try In 7 Days

Prototype a DAG memory for one multi-turn flow (e.g., scheduling) and route LLM prompts through a retrieved subgraph.

Build a TRIM prompt (few-shot) to classify new/update/check intents and run A/B tests vs linear history.

Measure hallucination counts and token use over 20–50 turns to validate cost and correctness gains.

Agent Features

Memory

spatial memory (DAG) — stores tasks as nodesrevision history per node

Planning

task decompositiondependency-tracked revisions

Tool Use

LLM prompting (no fine-tuning required)few-shot intent parsing

Frameworks

TMETRIMTMS-DAG

Is Agentic

Yes

Architectures

modular memory controller (TME)TMS-DAG forest (directed acyclic graph)

Optimization Features

Token Efficiency

19.4% total token savings in 6-round form-filling (725 vs 899 tokens)up to 42.8% savings in correction-heavy rounds (Round 5 example)

System Optimization

adjacency-list DAG storage for memory efficiencynode-level history to avoid replaying full conversation

Inference Optimization

compact subgraph retrieval to reduce prompt sizeLLM prompts at low temperature (0.3) for deterministic parsing

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/biubiutomato/TME-Agent

Data URLs

https://github.com/biubiutomato/TME-Agent (case scripts and benchmarks)

Risks & Boundaries

Limitations

TRIM currently relies on few-shot LLM prompts and can be brittle on ambiguous multi-intent inputs.

TMS-DAG caused residual inactive nodes in the cart case before a task-specific TRIM adaptation.

When Not To Use

Simple linear tasks where a flat history suffices (e.g., basic cart edits) and added DAG complexity adds overhead.

Latency-critical paths where extra intent classification and graph ops would slow responses.

Failure Modes

Misclassified intents (TRIM errors) create wrong node updates or missing updates.

DAG dependency conflicts can leave obsolete nodes active, producing hallucinated items.

Core Entities

Models

ChatGPT-4o

Metrics

Hallucinations (count)Confusions (count)Token usage (tokens, % savings)Task consistency (consistent tasks / total tasks)

Datasets

27-turn custom multi-scenario benchmark (trip, cooking, meeting, cart)

Benchmarks

Case-study benchmark across 4 scenarios (27 user turns)

Context Entities

Models

ReAct (baseline prompting style)Chain-of-Thought (baseline)base-flat (linear history baseline)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TME eliminated observed hallucinations in the evaluated scenarios.

TME eliminated observed confusions (misinterpreted checks vs updates) in the evaluated scenarios.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding

Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

Key finding

AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

Key finding

A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

Key finding

Route simple queries straight to fast tools; use memory + planner only for complex job-career requests to cut latency and improve accuracy.

Key finding