Amory: build narrative episodic memory that matches full-context quality while halving latency

January 9, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

0

Authors

Yue Zhou, Xiaobo Guo, Belhassen Bayar, Srinivasan H. Sengamedu

Links

Abstract / PDF

Why It Matters For Business

Amory raises long-conversation answer quality substantially while avoiding full-history cost; that improves product usefulness for persistent assistants with acceptable latency.

Summary TLDR

Amory is a working-memory system that turns long conversations into coherent story-like episodic threads plus a small semantic graph. It forms memory offline via agentic LLM reasoning (segmenting, binding, momentum-aware consolidation, semanticization) and retrieves by reasoning over narratives instead of plain embedding similarity. On the LOCOMO long-conversation benchmark, Amory (episodic+semantic) raises overall LLM-as-a-Judge accuracy to 87.7% vs Mem0 59.9% (+27.8% abs) while cutting response latency roughly in half compared to full-context reasoning. Improvements are largest on temporal and multi-hop queries; costs are extra offline processing and an agentic retriever that adds online-l

Problem Statement

Long conversations blow up compute if every turn reprocesses full history. Existing memory systems usually store fragmented embeddings or noisy graphs and then retrieve by similarity. That is fast but loses narrative context and hurts multi-hop and temporal reasoning. We need a memory that keeps coherent, chronological context while staying efficient.

Main Contribution

Amory: a working-memory framework that constructs episodic narratives and a peripheral semantic graph using offline agentic LLM procedures.

Momentum-aware consolidation: wait for topic inactivity to create subplots and update main headlines, reducing premature or noisy summaries.

Coherence-driven retriever: an LLM-based retriever that reasons over narrative structure (headlines, characters) instead of only embedding similarity.

Evaluation on LOCOMO and constructed AgentIF agentic scenarios showing large quality gains and moderate latency.

Key Findings

Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.

NumbersEM+SM overall J-score 87.7% vs Mem0 59.9% (+27.8% abs)

Amory matches or exceeds full-context quality on multi-hop and temporal questions while using much less context.

NumbersMulti-hop: EM 85.6% vs FC 82.6% (+3%); Temporal: EM+SM 90.4% vs FC 76.6% (+11.0%)

Amory compresses the usable context dramatically, reducing online context size and cutting full-context latency about in half.

NumbersMedian context compression at top-k=2 >96.3%; p99 latency EM+SM 4.18s vs FC 9.35s (~55% faster)

Consolidation timing matters: inactive consolidation improves temporal reasoning more than rapid consolidation.

NumbersOverall J: No consolidation 81.7% → Inactive consolidation 87.7% (Table 2); Temporal: rapid 82.3% vs inactive 87.7%

Agentic coherence retrieval achieves higher coverage than embedding retrieval, especially on multi-hop queries.

NumbersCoverage saturates around k=4; embedding retriever coverage significantly lower on multi-hop (Figure 4)

Results

Overall J-score (EM+SM vs Mem0)

Value87.7% vs 59.9%

BaselineMem0 59.9%

Multi-hop J-score

ValueEM 85.6% vs FC 82.6%

BaselineFull Context (FC) 82.6%

Temporal J-score

ValueEM+SM 90.4% vs FC 76.6%

BaselineFull Context (FC) 76.6%

Latency p99

ValueEM+SM 4.18s vs FC 9.35s

BaselineFull Context (FC) 9.35s

Constraint recall (agentic long conversation)

ValueEM+SM 47.4%

Baselinebest baseline ReadAgent 36.8%

Context compression at top-k=2

Value>96.3% compressed

Who Should Care

What To Try In 7 Days

Prototype narrative binding: segment a long chat into story threads using an LLM and compare single-turn answers with vs without narrative context.

Implement inactive consolidation: consolidate threads only after a pause to see if temporal question accuracy improves.

Add a tiny semantic graph for peripheral facts and test whether single-fact lookups improve single-hop accuracy.

Agent Features

Memory

  • episodic memory: hierarchical narrative threads
  • semantic memory: peripheral facts as graph triplets
  • momentum-aware consolidation (inactive triggers)

Planning

  • offline agentic reasoning for MemInit/MemBinding
  • momentum-aware consolidation strategy
  • coherence selection of top-k leaf nodes

Tool Use

  • LLM workers for segmentation, consolidation, retrieval
  • Neo4j + Cypher for semantic memory
  • embedding retriever as baseline comparison

Frameworks

  • Amory

Is Agentic

true

Architectures

  • episodic narrative tree (plot → subplots)
  • semantic graph (Neo4j triplets)
  • coherence-driven retriever

Collaboration

  • asynchronous offline workers update memories while system serves queries

Optimization Features

Token Efficiency

  • context compression via top-k narrative retrieval

System Optimization

  • asynchronous offline memory processing to avoid blocking online latency

Reproducibility

Data Urls

  • LOCOMO (Maharana et al., 2024)
  • AgentIF (Qi et al., 2025)

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Evaluation relies largely on LOCOMO and synthetic AgentIF constructions; real-world diversity is limited.
  • System uses free-text LLM procedures rather than learned neural memory representations.
  • Agentic retrieval adds online latency compared to pure embedding retrieval.
  • Semantic graphs are restricted to peripheral facts to avoid noisy dense graphs.

When Not To Use

  • When you can afford full-context reasoning and need the absolute best single-hop recall without extra engineering.
  • When strict ultra-low latency is mandatory and any LLM-based retrieval is too slow.
  • When you require learned, dense neural memory representations for downstream training pipelines.

Failure Modes

  • LLM mis-segmentation or misbinding can merge unrelated turns into the same narrative.
  • Semanticization via OpenIE-style extraction may produce noisy graph facts from casual dialogue.
  • Agentic retriever latency can spike under heavy load, reducing real-time responsiveness.
  • LLM-as-a-Judge may introduce evaluation bias despite rubric tightening.

Core Entities

Models

  • Claude 3.5 Sonnet V2 (used as base LLM)

Metrics

  • Accuracy
  • Latency percentiles (p50, p90, p95, p99)
  • Memory coverage rate
  • Context compression rate

Datasets

  • LOCOMO
  • AgentIF (constructed agentic conversations)

Benchmarks

  • LOCOMO (long-term conversational reasoning)