Overview
The idea is practical and integrates with existing LLMs without fine-tuning. Evidence comes from controlled case studies (27 turns) and ablations, but results are limited to scripted scenarios and a single LLM (ChatGPT-4o).
Citations0
Evidence Strength0.60
Confidence0.70
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
TME reduces wrong or conflicting actions from LLM agents and lowers API token costs by pruning irrelevant history, making multi-step assistants more reliable and cheaper to run.
Who Should Care
Summary TLDR
TME is a lightweight memory controller that turns an off-the-shelf LLM into a revision-aware agent. It stores task state in a directed acyclic graph (TMS-DAG) and uses TRIM (an intent parser) to add, update, or check nodes. In four interactive scenarios (trip planning, cooking, meeting scheduling, cart editing) across 27 user turns, TME reduced hallucinations and confusions to 0 and cut token use by 19.4% vs a flat baseline. The code and benchmarks are open-source.
Problem Statement
LLM agents that concatenate conversation history often hallucinate, repeat actions, or misinterpret corrections because linear context cannot track evolving subtasks, dependencies, or revisions. We need a lightweight memory layer that preserves task state, supports edits, filters irrelevant history, and works without fine-tuning.
Main Contribution
Task Memory Engine (TME): a plug-in spatial memory that stores tasks as a DAG instead of linear context.
TRIM: an LLM-prompted intent and subtask parser that maps inputs to DAG operations (new, update, check).
Key Findings
TME eliminated observed hallucinations in the evaluated scenarios.
TME eliminated observed confusions (misinterpreted checks vs updates) in the evaluated scenarios.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Hallucinations (count) | TME-DAG: 0 | ReAct: 3 | -3 | 27 user turns, 4 scenarios | Table 1, Section 6.1 | Table 1 |
| Confusions (count) | TME-DAG: 0 | ReAct: 5 | -5 | 27 user turns, 4 scenarios | Table 1, Section 6.1 | Table 1 |
What To Try In 7 Days
Prototype a DAG memory for one multi-turn flow (e.g., scheduling) and route LLM prompts through a retrieved subgraph.
Build a TRIM prompt (few-shot) to classify new/update/check intents and run A/B tests vs linear history.
Measure hallucination counts and token use over 20–50 turns to validate cost and correctness gains.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
TRIM currently relies on few-shot LLM prompts and can be brittle on ambiguous multi-intent inputs.
TMS-DAG caused residual inactive nodes in the cart case before a task-specific TRIM adaptation.
When Not To Use
Simple linear tasks where a flat history suffices (e.g., basic cart edits) and added DAG complexity adds overhead.
Latency-critical paths where extra intent classification and graph ops would slow responses.
Failure Modes
Misclassified intents (TRIM errors) create wrong node updates or missing updates.
DAG dependency conflicts can leave obsolete nodes active, producing hallucinated items.

