Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you build agents that must remember users and multi-session facts, a structured, timeline-aware memory reduces identity and temporal drift and improves preference stability across sessions.
Summary TLDR
BMAM is a modular, brain-inspired memory system for language-agent pipelines. It splits memory into specialized components (episodic, semantic, salience, control), organizes episodic traces on explicit timelines (StoryArc), and fuses lexical/dense/graph/temporal signals with reciprocal rank fusion. On long-horizon benchmarks BMAM achieves 78.45% on LoCoMo and shows a 24.6 percentage-point drop when its hippocampus-like episodic module is removed, highlighting episodic storage as critical for temporal consistency.
Problem Statement
LLM agents struggle to keep consistent, time-grounded behavior across long interactions. Context windows and plain RAG treat memory as text blobs and fail at persistent organization, temporal queries, and identity preservation. BMAM aims to manage what to store, how to index time, and how to retrieve evidence across sessions.
Main Contribution
Define "soul erosion": gradual loss of temporal coherence, semantic consistency, or user identity in long-horizon agents.
Propose BMAM: a multi-agent memory architecture with episodic timelines (StoryArc), semantic consolidation, salience tagging, and a central coordinator.
Show empirical gains on long-horizon benchmarks (e.g., 78.45% on LoCoMo) and ablations that identify episodic memory as critical.
Key Findings
BMAM achieves strong long-horizon dialogue accuracy on LoCoMo.
Removing the hippocampus-like episodic module causes a large drop in accuracy.
Temporal questions remain the hardest subcategory for BMAM.
BMAM preserves preferences well in adversarial tests.
Results
Accuracy
Accuracy
PrefEval personalized rate
Accuracy
Hippocampus ablation delta
Who Should Care
What To Try In 7 Days
Add timestamped episodic logs for user interactions; keep minimal narrative units.
Fuse lexical and dense retrieval with a lightweight rank fusion step to improve evidence coverage.
Tag high-salience events (milestones, preferences) to protect them from pruning.
Agent Features
Memory
- episodic (timeline-indexed)
- semantic (consolidated KG)
- salience-aware tagging
- working-memory buffer (10 items)
Planning
- uncertainty-driven multi-round retrieval
Tool Use
- LLM backend (gpt-4o-mini)
- embedding service (text-embed-3-small)
Frameworks
- Reciprocal Rank Fusion
- StoryArc timeline indexing
Is Agentic
true
Architectures
- multi-agent coordinator
- timeline-indexed episodic store (StoryArc)
- hybrid retrieval (lexical+dense+KG+temporal)
Collaboration
- central coordinator routes queries and consolidation
- separate agents for encoding, consolidation, retrieval, revision
Optimization Features
Token Efficiency
- compact episodic summaries to reduce context size
Infra Optimization
- use of vector store + knowledge graph + key-value episodic store
System Optimization
- pruning low-value memories
- salience-prioritized consolidation
Training Optimization
- background consolidation (asynchronous reconsolidation)
Inference Optimization
- fast-path vs slow-path retrieval to reduce runtime retrieval costs
- working-memory buffer for immediate context
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to four benchmarks; domain and multi-modal validation is future work.
- Code and implementation not yet released; reproducibility depends on releasing artifacts.
- Temporal normalization and date math remain error-prone (38% of sampled LoCoMo errors).
- Persona-style exact surface matching (PersonaMem) is still weak due to open-ended retrieval focus.
When Not To Use
- For very simple single-hop retrieval at extreme latency constraints — BMAM introduces routing overhead.
- When you need immediate multi-modal memory; BMAM is evaluated on text only.
- If you require turnkey open-source code now — implementation release is pending.
Failure Modes
- Temporal confusion (inaccurate date/duration/order) — 38% of manual errors
- Entity ambiguity (wrong-entity retrieval) — 28% of manual errors
- Retrieval coverage gaps (evidence stored but not retrieved) — 22% of manual errors
Core Entities
Models
- gpt-4o-mini (response/judge)
- text-embed-3-small (embeddings)
Metrics
- Accuracy
- Personalized response rate
- PrefEval inconsistency
- Ablation delta
Datasets
- LoCoMo
- LongMemEval
- PersonaMem
- PrefEval
Benchmarks
- LoCoMo
- LongMemEval
- PersonaMem
- PrefEval
Context Entities
Models
- MemOS (re-run baseline with GPT-4o-mini)

