Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Amory raises long-conversation answer quality substantially while avoiding full-history cost; that improves product usefulness for persistent assistants with acceptable latency.
Summary TLDR
Amory is a working-memory system that turns long conversations into coherent story-like episodic threads plus a small semantic graph. It forms memory offline via agentic LLM reasoning (segmenting, binding, momentum-aware consolidation, semanticization) and retrieves by reasoning over narratives instead of plain embedding similarity. On the LOCOMO long-conversation benchmark, Amory (episodic+semantic) raises overall LLM-as-a-Judge accuracy to 87.7% vs Mem0 59.9% (+27.8% abs) while cutting response latency roughly in half compared to full-context reasoning. Improvements are largest on temporal and multi-hop queries; costs are extra offline processing and an agentic retriever that adds online-l
Problem Statement
Long conversations blow up compute if every turn reprocesses full history. Existing memory systems usually store fragmented embeddings or noisy graphs and then retrieve by similarity. That is fast but loses narrative context and hurts multi-hop and temporal reasoning. We need a memory that keeps coherent, chronological context while staying efficient.
Main Contribution
Amory: a working-memory framework that constructs episodic narratives and a peripheral semantic graph using offline agentic LLM procedures.
Momentum-aware consolidation: wait for topic inactivity to create subplots and update main headlines, reducing premature or noisy summaries.
Coherence-driven retriever: an LLM-based retriever that reasons over narrative structure (headlines, characters) instead of only embedding similarity.
Evaluation on LOCOMO and constructed AgentIF agentic scenarios showing large quality gains and moderate latency.
Key Findings
Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.
Amory matches or exceeds full-context quality on multi-hop and temporal questions while using much less context.
Amory compresses the usable context dramatically, reducing online context size and cutting full-context latency about in half.
Consolidation timing matters: inactive consolidation improves temporal reasoning more than rapid consolidation.
Agentic coherence retrieval achieves higher coverage than embedding retrieval, especially on multi-hop queries.
Results
Overall J-score (EM+SM vs Mem0)
Multi-hop J-score
Temporal J-score
Latency p99
Constraint recall (agentic long conversation)
Context compression at top-k=2
Who Should Care
What To Try In 7 Days
Prototype narrative binding: segment a long chat into story threads using an LLM and compare single-turn answers with vs without narrative context.
Implement inactive consolidation: consolidate threads only after a pause to see if temporal question accuracy improves.
Add a tiny semantic graph for peripheral facts and test whether single-fact lookups improve single-hop accuracy.
Agent Features
Memory
- episodic memory: hierarchical narrative threads
- semantic memory: peripheral facts as graph triplets
- momentum-aware consolidation (inactive triggers)
Planning
- offline agentic reasoning for MemInit/MemBinding
- momentum-aware consolidation strategy
- coherence selection of top-k leaf nodes
Tool Use
- LLM workers for segmentation, consolidation, retrieval
- Neo4j + Cypher for semantic memory
- embedding retriever as baseline comparison
Frameworks
- Amory
Is Agentic
true
Architectures
- episodic narrative tree (plot → subplots)
- semantic graph (Neo4j triplets)
- coherence-driven retriever
Collaboration
- asynchronous offline workers update memories while system serves queries
Optimization Features
Token Efficiency
- context compression via top-k narrative retrieval
System Optimization
- asynchronous offline memory processing to avoid blocking online latency
Reproducibility
Data Urls
- LOCOMO (Maharana et al., 2024)
- AgentIF (Qi et al., 2025)
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Evaluation relies largely on LOCOMO and synthetic AgentIF constructions; real-world diversity is limited.
- System uses free-text LLM procedures rather than learned neural memory representations.
- Agentic retrieval adds online latency compared to pure embedding retrieval.
- Semantic graphs are restricted to peripheral facts to avoid noisy dense graphs.
When Not To Use
- When you can afford full-context reasoning and need the absolute best single-hop recall without extra engineering.
- When strict ultra-low latency is mandatory and any LLM-based retrieval is too slow.
- When you require learned, dense neural memory representations for downstream training pipelines.
Failure Modes
- LLM mis-segmentation or misbinding can merge unrelated turns into the same narrative.
- Semanticization via OpenIE-style extraction may produce noisy graph facts from casual dialogue.
- Agentic retriever latency can spike under heavy load, reducing real-time responsiveness.
- LLM-as-a-Judge may introduce evaluation bias despite rubric tightening.
Core Entities
Models
- Claude 3.5 Sonnet V2 (used as base LLM)
Metrics
- Accuracy
- Latency percentiles (p50, p90, p95, p99)
- Memory coverage rate
- Context compression rate
Datasets
- LOCOMO
- AgentIF (constructed agentic conversations)
Benchmarks
- LOCOMO (long-term conversational reasoning)

