Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
TME reduces wrong or conflicting actions from LLM agents and lowers API token costs by pruning irrelevant history, making multi-step assistants more reliable and cheaper to run.
Summary TLDR
TME is a lightweight memory controller that turns an off-the-shelf LLM into a revision-aware agent. It stores task state in a directed acyclic graph (TMS-DAG) and uses TRIM (an intent parser) to add, update, or check nodes. In four interactive scenarios (trip planning, cooking, meeting scheduling, cart editing) across 27 user turns, TME reduced hallucinations and confusions to 0 and cut token use by 19.4% vs a flat baseline. The code and benchmarks are open-source.
Problem Statement
LLM agents that concatenate conversation history often hallucinate, repeat actions, or misinterpret corrections because linear context cannot track evolving subtasks, dependencies, or revisions. We need a lightweight memory layer that preserves task state, supports edits, filters irrelevant history, and works without fine-tuning.
Main Contribution
Task Memory Engine (TME): a plug-in spatial memory that stores tasks as a DAG instead of linear context.
TRIM: an LLM-prompted intent and subtask parser that maps inputs to DAG operations (new, update, check).
Empirical validation across 4 multi-step scenarios showing zero hallucinations/confusions in the study and ~19.4% token savings.
Open-source release of code, case scripts, and benchmarks for reproducibility.
Key Findings
TME eliminated observed hallucinations in the evaluated scenarios.
TME eliminated observed confusions (misinterpreted checks vs updates) in the evaluated scenarios.
TME reduced prompt token usage by pruning context to a compact subgraph.
Results
Hallucinations (count)
Confusions (count)
Token usage (total tokens)
Task consistency
Who Should Care
What To Try In 7 Days
Prototype a DAG memory for one multi-turn flow (e.g., scheduling) and route LLM prompts through a retrieved subgraph.
Build a TRIM prompt (few-shot) to classify new/update/check intents and run A/B tests vs linear history.
Measure hallucination counts and token use over 20–50 turns to validate cost and correctness gains.
Agent Features
Memory
- spatial memory (DAG) — stores tasks as nodes
- revision history per node
Planning
- task decomposition
- dependency-tracked revisions
Tool Use
- LLM prompting (no fine-tuning required)
- few-shot intent parsing
Frameworks
- TME
- TRIM
- TMS-DAG
Is Agentic
true
Architectures
- modular memory controller (TME)
- TMS-DAG forest (directed acyclic graph)
Optimization Features
Token Efficiency
- 19.4% total token savings in 6-round form-filling (725 vs 899 tokens)
- up to 42.8% savings in correction-heavy rounds (Round 5 example)
System Optimization
- adjacency-list DAG storage for memory efficiency
- node-level history to avoid replaying full conversation
Inference Optimization
- compact subgraph retrieval to reduce prompt size
- LLM prompts at low temperature (0.3) for deterministic parsing
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- TRIM currently relies on few-shot LLM prompts and can be brittle on ambiguous multi-intent inputs.
- TMS-DAG caused residual inactive nodes in the cart case before a task-specific TRIM adaptation.
- Rollback and full conflict resolution are noted but not fully implemented.
- Experiments are small-scale, scripted, and run with a single LLM (ChatGPT-4o) on consumer hardware.
When Not To Use
- Simple linear tasks where a flat history suffices (e.g., basic cart edits) and added DAG complexity adds overhead.
- Latency-critical paths where extra intent classification and graph ops would slow responses.
Failure Modes
- Misclassified intents (TRIM errors) create wrong node updates or missing updates.
- DAG dependency conflicts can leave obsolete nodes active, producing hallucinated items.
- Ambiguous multi-intent user inputs can cause inconsistent subtask splitting and brittle behavior.
Core Entities
Models
- ChatGPT-4o
Metrics
- Hallucinations (count)
- Confusions (count)
- Token usage (tokens, % savings)
- Task consistency (consistent tasks / total tasks)
Datasets
- 27-turn custom multi-scenario benchmark (trip, cooking, meeting, cart)
Benchmarks
- Case-study benchmark across 4 scenarios (27 user turns)
Context Entities
Models
- ReAct (baseline prompting style)
- Chain-of-Thought (baseline)
- base-flat (linear history baseline)

