Amory: build narrative episodic memory that matches full-context quality while halving latency

January 9, 20268 min

Overview

Decision SnapshotNeeds Validation

Results on LOCOMO and an agentic scenario show strong gains, but evaluation is limited to public benchmarks and a single base LLM; engineering cost rises from offline agentic processing and LLM-based retrieval.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Yue Zhou, Xiaobo Guo, Belhassen Bayar, Srinivasan H. Sengamedu

Links

Abstract / PDF / Data

Why It Matters For Business

Amory raises long-conversation answer quality substantially while avoiding full-history cost; that improves product usefulness for persistent assistants with acceptable latency.

Who Should Care

Summary TLDR

Amory is a working-memory system that turns long conversations into coherent story-like episodic threads plus a small semantic graph. It forms memory offline via agentic LLM reasoning (segmenting, binding, momentum-aware consolidation, semanticization) and retrieves by reasoning over narratives instead of plain embedding similarity. On the LOCOMO long-conversation benchmark, Amory (episodic+semantic) raises overall LLM-as-a-Judge accuracy to 87.7% vs Mem0 59.9% (+27.8% abs) while cutting response latency roughly in half compared to full-context reasoning. Improvements are largest on temporal and multi-hop queries; costs are extra offline processing and an agentic retriever that adds online-l

Problem Statement

Long conversations blow up compute if every turn reprocesses full history. Existing memory systems usually store fragmented embeddings or noisy graphs and then retrieve by similarity. That is fast but loses narrative context and hurts multi-hop and temporal reasoning. We need a memory that keeps coherent, chronological context while staying efficient.

Main Contribution

Amory: a working-memory framework that constructs episodic narratives and a peripheral semantic graph using offline agentic LLM procedures.

Momentum-aware consolidation: wait for topic inactivity to create subplots and update main headlines, reducing premature or noisy summaries.

Key Findings

Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.

NumbersEM+SM overall J-score 87.7% vs Mem0 59.9% (+27.8% abs)

Practical UseIf you need higher answer correctness on long conversations, use narrative episodic memory plus a small semantic store rather than raw embeddings.

Evidence RefTable 1 (EM+SM vs Mem0)

Amory matches or exceeds full-context quality on multi-hop and temporal questions while using much less context.

NumbersMulti-hop: EM 85.6% vs FC 82.6% (+3%); Temporal: EM+SM 90.4% vs FC 76.6% (+11.0%)

Practical UseFor multi-step or time-based queries, structured narratives improve accuracy versus passing full history, letting you avoid full-context costs.

Evidence RefTable 1 (task breakdown)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall J-score (EM+SM vs Mem0)87.7% vs 59.9%Mem0 59.9%+27.8% absLOCOMO (overall)Table 1 shows EM+SM 87.7% vs Mem0 59.9%Table 1
Multi-hop J-scoreEM 85.6% vs FC 82.6%Full Context (FC) 82.6%+3.0% absLOCOMO (multi-hop)Table 1 multi-hop numbersTable 1

What To Try In 7 Days

Prototype narrative binding: segment a long chat into story threads using an LLM and compare single-turn answers with vs without narrative context.

Implement inactive consolidation: consolidate threads only after a pause to see if temporal question accuracy improves.

Add a tiny semantic graph for peripheral facts and test whether single-fact lookups improve single-hop accuracy.

Agent Features

Memory
episodic memory: hierarchical narrative threadssemantic memory: peripheral facts as graph tripletsmomentum-aware consolidation (inactive triggers)
Planning
offline agentic reasoning for MemInit/MemBindingmomentum-aware consolidation strategycoherence selection of top-k leaf nodes
Tool Use
LLM workers for segmentation, consolidation, retrievalNeo4j + Cypher for semantic memoryembedding retriever as baseline comparison
Frameworks
Amory
Is Agentic

Yes

Architectures
episodic narrative tree (plot → subplots)semantic graph (Neo4j triplets)coherence-driven retriever
Collaboration
asynchronous offline workers update memories while system serves queries

Optimization Features

Token Efficiency
context compression via top-k narrative retrieval
System Optimization
asynchronous offline memory processing to avoid blocking online latency

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusNo
LicenseUnknown

Data URLs

LOCOMO (Maharana et al., 2024)AgentIF (Qi et al., 2025)

Risks & Boundaries

Limitations

Evaluation relies largely on LOCOMO and synthetic AgentIF constructions; real-world diversity is limited.

System uses free-text LLM procedures rather than learned neural memory representations.

When Not To Use

When you can afford full-context reasoning and need the absolute best single-hop recall without extra engineering.

When strict ultra-low latency is mandatory and any LLM-based retrieval is too slow.

Failure Modes

LLM mis-segmentation or misbinding can merge unrelated turns into the same narrative.

Semanticization via OpenIE-style extraction may produce noisy graph facts from casual dialogue.

Core Entities

Models

Claude 3.5 Sonnet V2 (used as base LLM)

Metrics

AccuracyLatency percentiles (p50, p90, p95, p99)Memory coverage rateContext compression rate

Datasets

LOCOMOAgentIF (constructed agentic conversations)

Benchmarks

LOCOMO (long-term conversational reasoning)