Replace flat context with a graph memory (TME) to cut hallucinations and save tokens in multi-step LLM agents

May 26, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Ye Ye

Links

Abstract / PDF

Why It Matters For Business

TME reduces wrong or conflicting actions from LLM agents and lowers API token costs by pruning irrelevant history, making multi-step assistants more reliable and cheaper to run.

Summary TLDR

TME is a lightweight memory controller that turns an off-the-shelf LLM into a revision-aware agent. It stores task state in a directed acyclic graph (TMS-DAG) and uses TRIM (an intent parser) to add, update, or check nodes. In four interactive scenarios (trip planning, cooking, meeting scheduling, cart editing) across 27 user turns, TME reduced hallucinations and confusions to 0 and cut token use by 19.4% vs a flat baseline. The code and benchmarks are open-source.

Problem Statement

LLM agents that concatenate conversation history often hallucinate, repeat actions, or misinterpret corrections because linear context cannot track evolving subtasks, dependencies, or revisions. We need a lightweight memory layer that preserves task state, supports edits, filters irrelevant history, and works without fine-tuning.

Main Contribution

Task Memory Engine (TME): a plug-in spatial memory that stores tasks as a DAG instead of linear context.

TRIM: an LLM-prompted intent and subtask parser that maps inputs to DAG operations (new, update, check).

Empirical validation across 4 multi-step scenarios showing zero hallucinations/confusions in the study and ~19.4% token savings.

Open-source release of code, case scripts, and benchmarks for reproducibility.

Key Findings

TME eliminated observed hallucinations in the evaluated scenarios.

NumbersHallucinations: 0 vs ReAct 3 (27 turns)

TME eliminated observed confusions (misinterpreted checks vs updates) in the evaluated scenarios.

NumbersConfusions: 0 vs ReAct 5 (27 turns)

TME reduced prompt token usage by pruning context to a compact subgraph.

NumbersTokens: 725 vs Baseline 899 (19.4% savings over 6 rounds)

Results

Hallucinations (count)

ValueTME-DAG: 0

BaselineReAct: 3

Confusions (count)

ValueTME-DAG: 0

BaselineReAct: 5

Token usage (total tokens)

ValueTME: 725

BaselineBase-flat: 899

Task consistency

ValueTME-DAG: 4/4 tasks consistent

BaselineReAct: 2/4

Who Should Care

What To Try In 7 Days

Prototype a DAG memory for one multi-turn flow (e.g., scheduling) and route LLM prompts through a retrieved subgraph.

Build a TRIM prompt (few-shot) to classify new/update/check intents and run A/B tests vs linear history.

Measure hallucination counts and token use over 20–50 turns to validate cost and correctness gains.

Agent Features

Memory

  • spatial memory (DAG) — stores tasks as nodes
  • revision history per node

Planning

  • task decomposition
  • dependency-tracked revisions

Tool Use

  • LLM prompting (no fine-tuning required)
  • few-shot intent parsing

Frameworks

  • TME
  • TRIM
  • TMS-DAG

Is Agentic

true

Architectures

  • modular memory controller (TME)
  • TMS-DAG forest (directed acyclic graph)

Optimization Features

Token Efficiency

  • 19.4% total token savings in 6-round form-filling (725 vs 899 tokens)
  • up to 42.8% savings in correction-heavy rounds (Round 5 example)

System Optimization

  • adjacency-list DAG storage for memory efficiency
  • node-level history to avoid replaying full conversation

Inference Optimization

  • compact subgraph retrieval to reduce prompt size
  • LLM prompts at low temperature (0.3) for deterministic parsing

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • TRIM currently relies on few-shot LLM prompts and can be brittle on ambiguous multi-intent inputs.
  • TMS-DAG caused residual inactive nodes in the cart case before a task-specific TRIM adaptation.
  • Rollback and full conflict resolution are noted but not fully implemented.
  • Experiments are small-scale, scripted, and run with a single LLM (ChatGPT-4o) on consumer hardware.

When Not To Use

  • Simple linear tasks where a flat history suffices (e.g., basic cart edits) and added DAG complexity adds overhead.
  • Latency-critical paths where extra intent classification and graph ops would slow responses.

Failure Modes

  • Misclassified intents (TRIM errors) create wrong node updates or missing updates.
  • DAG dependency conflicts can leave obsolete nodes active, producing hallucinated items.
  • Ambiguous multi-intent user inputs can cause inconsistent subtask splitting and brittle behavior.

Core Entities

Models

  • ChatGPT-4o

Metrics

  • Hallucinations (count)
  • Confusions (count)
  • Token usage (tokens, % savings)
  • Task consistency (consistent tasks / total tasks)

Datasets

  • 27-turn custom multi-scenario benchmark (trip, cooking, meeting, cart)

Benchmarks

  • Case-study benchmark across 4 scenarios (27 user turns)

Context Entities

Models

  • ReAct (baseline prompting style)
  • Chain-of-Thought (baseline)
  • base-flat (linear history baseline)