Overview
The system is modular and shows consistent gains on the provided long-dialog benchmark; ablations quantify component value, but production work remains for latency, privacy, and broader benchmarks.
Citations0
Evidence Strength0.90
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Editable multimodal memory lets assistants evolve with users (names, images, preferences) and yields measurable accuracy gains on long conversations—helpful for retention and personalized UX.
Who Should Care
Summary TLDR
M2A is a two-agent system (ChatAgent + MemoryManager) that keeps an editable multimodal memory across long, multi-session dialogs. Memory has two layers: an append-only RawMessageStore (full logs) and a SemanticMemoryStore (high-level summaries) linked by evidence IDs. Retrieval uses three parallel paths—dense text, BM25 sparse text, and cross-modal image embeddings—fused by Reciprocal Rank Fusion. M2A updates memory during conversation (Query → Generate → Update) and shows sizable accuracy gains on an enhanced LoCoMo benchmark (e.g., 44.64% vs 33.27% avg on GPT-4o-mini; 54.69% vs 43.95% on Qwen3-VL-8B). Ablations show dual-layer, iterative retrieval, and tri-path retrieval each contribute (
Problem Statement
Personalized multimodal assistants must remember and evolve user-specific concepts, names, images, and preferences across weeks or months. Current methods either bake concepts into fixed model tokens or store static profiles; both fail when users refine or correct concepts over time or when long conversations exceed the model context window. The practical need is an editable, multimodal memory that can be queried and updated autonomously during long-term interactions.
Main Contribution
Agentic online personalized memory: two cooperating agents let the system decide when to read or write user memory during conversation.
Dual-layer hybrid memory: RawMessageStore (immutable logs) + SemanticMemoryStore (high-level entries) linked by evidence IDs for progressive narrowing.
Key Findings
M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline
Dual-layer memory and iterative retrieval materially boost accuracy
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4o-mini: M2A 44.64% | RAG 33.27% | Mem0 34.73% | A-MEM 36.26% | RAG | +11.37 pp vs RAG | enhanced LoCoMo (all categories) | Table 1, Section 5.2 | Table 1 |
| Accuracy | Qwen3-VL-8B: M2A 54.69% | best baseline 43.95% | best baseline | +10.74 pp vs best baseline | enhanced LoCoMo (all categories) | Table 1, Section 5.2 | Table 1 |
What To Try In 7 Days
Prototype a dual-layer store: append-only raw logs + semantic summaries linked by evidence IDs.
Add tri-path retrieval (dense text, BM25, image embeddings) and fuse with RRF for robust recalls.
Implement a simple ChatAgent that triggers updates only on clear user-introduced facts to limit write-backs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Evaluation is on an enhanced, synthetic LoCoMo dataset with injected sessions; real-world distributions may differ.
LLM-as-a-judge binary scoring can be lenient on temporal phrasing and may mask fine-grained errors.
When Not To Use
When strict privacy rules forbid storing or editing user logs in external memory.
For latency-sensitive, single-turn tasks where full long-term memory provides no benefit.
Failure Modes
Stale or contradictory semantic entries if update/delete logic misfires.
Missed retrievals for rare aliases if BM25 or embeddings fail to match.

