Amory: build narrative episodic memory that matches full-context quality while halving latency

Overview

Decision SnapshotNeeds Validation

Results on LOCOMO and an agentic scenario show strong gains, but evaluation is limited to public benchmarks and a single base LLM; engineering cost rises from offline agentic processing and LLM-based retrieval.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Yue Zhou, Xiaobo Guo, Belhassen Bayar, Srinivasan H. Sengamedu

Links

Abstract / PDF / Data

Why It Matters For Business

Amory raises long-conversation answer quality substantially while avoiding full-history cost; that improves product usefulness for persistent assistants with acceptable latency.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

Amory is a working-memory system that turns long conversations into coherent story-like episodic threads plus a small semantic graph. It forms memory offline via agentic LLM reasoning (segmenting, binding, momentum-aware consolidation, semanticization) and retrieves by reasoning over narratives instead of plain embedding similarity. On the LOCOMO long-conversation benchmark, Amory (episodic+semantic) raises overall LLM-as-a-Judge accuracy to 87.7% vs Mem0 59.9% (+27.8% abs) while cutting response latency roughly in half compared to full-context reasoning. Improvements are largest on temporal and multi-hop queries; costs are extra offline processing and an agentic retriever that adds online-l

Problem Statement

Long conversations blow up compute if every turn reprocesses full history. Existing memory systems usually store fragmented embeddings or noisy graphs and then retrieve by similarity. That is fast but loses narrative context and hurts multi-hop and temporal reasoning. We need a memory that keeps coherent, chronological context while staying efficient.

Main Contribution

Amory: a working-memory framework that constructs episodic narratives and a peripheral semantic graph using offline agentic LLM procedures.

Momentum-aware consolidation: wait for topic inactivity to create subplots and update main headlines, reducing premature or noisy summaries.

Key Findings

Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.

NumbersEM+SM overall J-score 87.7% vs Mem0 59.9% (+27.8% abs)

Practical UseIf you need higher answer correctness on long conversations, use narrative episodic memory plus a small semantic store rather than raw embeddings.

Evidence RefTable 1 (EM+SM vs Mem0)

Amory matches or exceeds full-context quality on multi-hop and temporal questions while using much less context.

NumbersMulti-hop: EM 85.6% vs FC 82.6% (+3%); Temporal: EM+SM 90.4% vs FC 76.6% (+11.0%)

Practical UseFor multi-step or time-based queries, structured narratives improve accuracy versus passing full history, letting you avoid full-context costs.

Evidence RefTable 1 (task breakdown)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall J-score (EM+SM vs Mem0)	87.7% vs 59.9%	Mem0 59.9%	+27.8% abs	LOCOMO (overall)	Table 1 shows EM+SM 87.7% vs Mem0 59.9%	Table 1
Multi-hop J-score	EM 85.6% vs FC 82.6%	Full Context (FC) 82.6%	+3.0% abs	LOCOMO (multi-hop)	Table 1 multi-hop numbers	Table 1

What To Try In 7 Days

Prototype narrative binding: segment a long chat into story threads using an LLM and compare single-turn answers with vs without narrative context.

Implement inactive consolidation: consolidate threads only after a pause to see if temporal question accuracy improves.

Add a tiny semantic graph for peripheral facts and test whether single-fact lookups improve single-hop accuracy.

Agent Features

Memory

episodic memory: hierarchical narrative threadssemantic memory: peripheral facts as graph tripletsmomentum-aware consolidation (inactive triggers)

Planning

offline agentic reasoning for MemInit/MemBindingmomentum-aware consolidation strategycoherence selection of top-k leaf nodes

Tool Use

LLM workers for segmentation, consolidation, retrievalNeo4j + Cypher for semantic memoryembedding retriever as baseline comparison

Frameworks

Amory

Is Agentic

Yes

Architectures

episodic narrative tree (plot → subplots)semantic graph (Neo4j triplets)coherence-driven retriever

Collaboration

asynchronous offline workers update memories while system serves queries

Optimization Features

Token Efficiency

context compression via top-k narrative retrieval

System Optimization

asynchronous offline memory processing to avoid blocking online latency

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Data URLs

LOCOMO (Maharana et al., 2024)AgentIF (Qi et al., 2025)

Risks & Boundaries

Limitations

Evaluation relies largely on LOCOMO and synthetic AgentIF constructions; real-world diversity is limited.

System uses free-text LLM procedures rather than learned neural memory representations.

When Not To Use

When you can afford full-context reasoning and need the absolute best single-hop recall without extra engineering.

When strict ultra-low latency is mandatory and any LLM-based retrieval is too slow.

Failure Modes

LLM mis-segmentation or misbinding can merge unrelated turns into the same narrative.

Semanticization via OpenIE-style extraction may produce noisy graph facts from casual dialogue.

Core Entities

Models

Claude 3.5 Sonnet V2 (used as base LLM)

Metrics

AccuracyLatency percentiles (p50, p90, p95, p99)Memory coverage rateContext compression rate

Datasets

LOCOMOAgentIF (constructed agentic conversations)

Benchmarks

LOCOMO (long-term conversational reasoning)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.

Amory matches or exceeds full-context quality on multi-hop and temporal questions while using much less context.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding