BMAM: brain-inspired multi-agent memory that improves long-horizon agent consistency

January 28, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Yang Li, Jiaxiang Liu, Yusong Wang, Yujie Wu, Mingkun Xu

Links

Abstract / PDF

Why It Matters For Business

If you build agents that must remember users and multi-session facts, a structured, timeline-aware memory reduces identity and temporal drift and improves preference stability across sessions.

Summary TLDR

BMAM is a modular, brain-inspired memory system for language-agent pipelines. It splits memory into specialized components (episodic, semantic, salience, control), organizes episodic traces on explicit timelines (StoryArc), and fuses lexical/dense/graph/temporal signals with reciprocal rank fusion. On long-horizon benchmarks BMAM achieves 78.45% on LoCoMo and shows a 24.6 percentage-point drop when its hippocampus-like episodic module is removed, highlighting episodic storage as critical for temporal consistency.

Problem Statement

LLM agents struggle to keep consistent, time-grounded behavior across long interactions. Context windows and plain RAG treat memory as text blobs and fail at persistent organization, temporal queries, and identity preservation. BMAM aims to manage what to store, how to index time, and how to retrieve evidence across sessions.

Main Contribution

Define "soul erosion": gradual loss of temporal coherence, semantic consistency, or user identity in long-horizon agents.

Propose BMAM: a multi-agent memory architecture with episodic timelines (StoryArc), semantic consolidation, salience tagging, and a central coordinator.

Show empirical gains on long-horizon benchmarks (e.g., 78.45% on LoCoMo) and ablations that identify episodic memory as critical.

Key Findings

BMAM achieves strong long-horizon dialogue accuracy on LoCoMo.

Numbers78.45% (1558/1986)

Removing the hippocampus-like episodic module causes a large drop in accuracy.

Numbers-24.62% absolute on a LoCoMo subset

Temporal questions remain the hardest subcategory for BMAM.

NumbersTemporal accuracy 62.3% on LoCoMo

BMAM preserves preferences well in adversarial tests.

NumbersPersonalized response rate 72.9%; inconsistency 0.1%

Results

Accuracy

Value78.45%

Accuracy

Value67.60%

PrefEval personalized rate

Value72.90%

Accuracy

Value48.9%

Hippocampus ablation delta

Value-24.62%

BaselineFull BMAM

Who Should Care

What To Try In 7 Days

Add timestamped episodic logs for user interactions; keep minimal narrative units.

Fuse lexical and dense retrieval with a lightweight rank fusion step to improve evidence coverage.

Tag high-salience events (milestones, preferences) to protect them from pruning.

Agent Features

Memory

  • episodic (timeline-indexed)
  • semantic (consolidated KG)
  • salience-aware tagging
  • working-memory buffer (10 items)

Planning

  • uncertainty-driven multi-round retrieval

Tool Use

  • LLM backend (gpt-4o-mini)
  • embedding service (text-embed-3-small)

Frameworks

  • Reciprocal Rank Fusion
  • StoryArc timeline indexing

Is Agentic

true

Architectures

  • multi-agent coordinator
  • timeline-indexed episodic store (StoryArc)
  • hybrid retrieval (lexical+dense+KG+temporal)

Collaboration

  • central coordinator routes queries and consolidation
  • separate agents for encoding, consolidation, retrieval, revision

Optimization Features

Token Efficiency

  • compact episodic summaries to reduce context size

Infra Optimization

  • use of vector store + knowledge graph + key-value episodic store

System Optimization

  • pruning low-value memories
  • salience-prioritized consolidation

Training Optimization

  • background consolidation (asynchronous reconsolidation)

Inference Optimization

  • fast-path vs slow-path retrieval to reduce runtime retrieval costs
  • working-memory buffer for immediate context

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to four benchmarks; domain and multi-modal validation is future work.
  • Code and implementation not yet released; reproducibility depends on releasing artifacts.
  • Temporal normalization and date math remain error-prone (38% of sampled LoCoMo errors).
  • Persona-style exact surface matching (PersonaMem) is still weak due to open-ended retrieval focus.

When Not To Use

  • For very simple single-hop retrieval at extreme latency constraints — BMAM introduces routing overhead.
  • When you need immediate multi-modal memory; BMAM is evaluated on text only.
  • If you require turnkey open-source code now — implementation release is pending.

Failure Modes

  • Temporal confusion (inaccurate date/duration/order) — 38% of manual errors
  • Entity ambiguity (wrong-entity retrieval) — 28% of manual errors
  • Retrieval coverage gaps (evidence stored but not retrieved) — 22% of manual errors

Core Entities

Models

  • gpt-4o-mini (response/judge)
  • text-embed-3-small (embeddings)

Metrics

  • Accuracy
  • Personalized response rate
  • PrefEval inconsistency
  • Ablation delta

Datasets

  • LoCoMo
  • LongMemEval
  • PersonaMem
  • PrefEval

Benchmarks

  • LoCoMo
  • LongMemEval
  • PersonaMem
  • PrefEval

Context Entities

Models

  • MemOS (re-run baseline with GPT-4o-mini)