Overview
Results on LOCOMO and an agentic scenario show strong gains, but evaluation is limited to public benchmarks and a single base LLM; engineering cost rises from offline agentic processing and LLM-based retrieval.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/6
Reproducibility
Status: Partial assets available
Open source: No
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
Amory raises long-conversation answer quality substantially while avoiding full-history cost; that improves product usefulness for persistent assistants with acceptable latency.
Who Should Care
Summary TLDR
Amory is a working-memory system that turns long conversations into coherent story-like episodic threads plus a small semantic graph. It forms memory offline via agentic LLM reasoning (segmenting, binding, momentum-aware consolidation, semanticization) and retrieves by reasoning over narratives instead of plain embedding similarity. On the LOCOMO long-conversation benchmark, Amory (episodic+semantic) raises overall LLM-as-a-Judge accuracy to 87.7% vs Mem0 59.9% (+27.8% abs) while cutting response latency roughly in half compared to full-context reasoning. Improvements are largest on temporal and multi-hop queries; costs are extra offline processing and an agentic retriever that adds online-l
Problem Statement
Long conversations blow up compute if every turn reprocesses full history. Existing memory systems usually store fragmented embeddings or noisy graphs and then retrieve by similarity. That is fast but loses narrative context and hurts multi-hop and temporal reasoning. We need a memory that keeps coherent, chronological context while staying efficient.
Main Contribution
Amory: a working-memory framework that constructs episodic narratives and a peripheral semantic graph using offline agentic LLM procedures.
Momentum-aware consolidation: wait for topic inactivity to create subplots and update main headlines, reducing premature or noisy summaries.
Key Findings
Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.
Amory matches or exceeds full-context quality on multi-hop and temporal questions while using much less context.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall J-score (EM+SM vs Mem0) | 87.7% vs 59.9% | Mem0 59.9% | +27.8% abs | LOCOMO (overall) | Table 1 shows EM+SM 87.7% vs Mem0 59.9% | Table 1 |
| Multi-hop J-score | EM 85.6% vs FC 82.6% | Full Context (FC) 82.6% | +3.0% abs | LOCOMO (multi-hop) | Table 1 multi-hop numbers | Table 1 |
What To Try In 7 Days
Prototype narrative binding: segment a long chat into story threads using an LLM and compare single-turn answers with vs without narrative context.
Implement inactive consolidation: consolidate threads only after a pause to see if temporal question accuracy improves.
Add a tiny semantic graph for peripheral facts and test whether single-fact lookups improve single-hop accuracy.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation relies largely on LOCOMO and synthetic AgentIF constructions; real-world diversity is limited.
System uses free-text LLM procedures rather than learned neural memory representations.
When Not To Use
When you can afford full-context reasoning and need the absolute best single-hop recall without extra engineering.
When strict ultra-low latency is mandatory and any LLM-based retrieval is too slow.
Failure Modes
LLM mis-segmentation or misbinding can merge unrelated turns into the same narrative.
Semanticization via OpenIE-style extraction may produce noisy graph facts from casual dialogue.

