Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Editable multimodal memory lets assistants evolve with users (names, images, preferences) and yields measurable accuracy gains on long conversations—helpful for retention and personalized UX.
Summary TLDR
M2A is a two-agent system (ChatAgent + MemoryManager) that keeps an editable multimodal memory across long, multi-session dialogs. Memory has two layers: an append-only RawMessageStore (full logs) and a SemanticMemoryStore (high-level summaries) linked by evidence IDs. Retrieval uses three parallel paths—dense text, BM25 sparse text, and cross-modal image embeddings—fused by Reciprocal Rank Fusion. M2A updates memory during conversation (Query → Generate → Update) and shows sizable accuracy gains on an enhanced LoCoMo benchmark (e.g., 44.64% vs 33.27% avg on GPT-4o-mini; 54.69% vs 43.95% on Qwen3-VL-8B). Ablations show dual-layer, iterative retrieval, and tri-path retrieval each contribute (
Problem Statement
Personalized multimodal assistants must remember and evolve user-specific concepts, names, images, and preferences across weeks or months. Current methods either bake concepts into fixed model tokens or store static profiles; both fail when users refine or correct concepts over time or when long conversations exceed the model context window. The practical need is an editable, multimodal memory that can be queried and updated autonomously during long-term interactions.
Main Contribution
Agentic online personalized memory: two cooperating agents let the system decide when to read or write user memory during conversation.
Dual-layer hybrid memory: RawMessageStore (immutable logs) + SemanticMemoryStore (high-level entries) linked by evidence IDs for progressive narrowing.
Tri-path multimodal retrieval: dense text, BM25 sparse text, and cross-modal image embeddings fused by Reciprocal Rank Fusion.
Reusable multimodal data synthesis: injects concept-grounded sessions into long dialogs to train and evaluate memory-driven personalization.
Key Findings
M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline
Dual-layer memory and iterative retrieval materially boost accuracy
Tri-path retrieval improves robustness to names, semantics, and images
Results
Accuracy
Accuracy
Ablation: remove dual-layer memory
Ablation: remove iterative retrieval
Dataset size
Who Should Care
What To Try In 7 Days
Prototype a dual-layer store: append-only raw logs + semantic summaries linked by evidence IDs.
Add tri-path retrieval (dense text, BM25, image embeddings) and fuse with RRF for robust recalls.
Implement a simple ChatAgent that triggers updates only on clear user-introduced facts to limit write-backs.
Agent Features
Memory
- RawMessageStore (append-only full logs)
- SemanticMemoryStore (editable high-level entries)
- evidence IDs linking semantic entries to raw logs
Planning
- ReAct-style Query → Generate → Update workflow
- iterative multi-round retrieval and refinement
Tool Use
- memory query (read)
- memory update (create/delete/replace)
- fetch raw messages by ID ranges
Frameworks
- ReAct-inspired agent workflow
- Progressive narrowing retrieval (semantic → raw)
Is Agentic
true
Architectures
- two-agent (ChatAgent + MemoryManager)
- duallayer hybrid memory bank
Collaboration
- ChatAgent decides when to query or update
- MemoryManager executes read-write operations and reasoning
Optimization Features
Token Efficiency
- semantic summaries reduce prompt size compared to full logs
System Optimization
- Milvus vector store for semantic vectors
- RRF fusion (k=60) to combine retrieval paths
Inference Optimization
- vLLM for efficient local inference
- retrieve top-10 per path then fuse to limit candidates
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation is on an enhanced, synthetic LoCoMo dataset with injected sessions; real-world distributions may differ.
- LLM-as-a-judge binary scoring can be lenient on temporal phrasing and may mask fine-grained errors.
- Memory operations add compute and storage overhead (vector indices, captioning, iterative retrieval).
- Quality of image captioning and cross-modal embeddings affects recall for visual-centric queries.
When Not To Use
- When strict privacy rules forbid storing or editing user logs in external memory.
- For latency-sensitive, single-turn tasks where full long-term memory provides no benefit.
- If deployment cannot support vector stores, image embedding pipelines, or iterative agent loops.
Failure Modes
- Stale or contradictory semantic entries if update/delete logic misfires.
- Missed retrievals for rare aliases if BM25 or embeddings fail to match.
- Hallucinated captions or misaligned cross-modal embeddings causing wrong image-based recalls.
- Overwriting correct long-term facts if automated updates are too aggressive.
Core Entities
Models
- M2A
- Yo'LLaVA
- MC-LLaVA
- A-MEM
- Mem0
- LoCoMo
- Qwen3-VL-32B
- Qwen3-VL-8B
- GPT-4o
- GPT-4o-mini
- GLM4.6V-Flash
Metrics
- Accuracy
Datasets
- LoCoMo (enhanced)
- Yo'LLaVA sessions
- MC-LLaVA sessions
Benchmarks
- enhanced LoCoMo

