M2A: editable dual-layer multimodal memory for evolving personalization

February 7, 20268 min

Overview

Decision SnapshotNeeds Validation

The system is modular and shows consistent gains on the provided long-dialog benchmark; ablations quantify component value, but production work remains for latency, privacy, and broader benchmarks.

Citations0

Evidence Strength0.90

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, Wentao Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

Editable multimodal memory lets assistants evolve with users (names, images, preferences) and yields measurable accuracy gains on long conversations—helpful for retention and personalized UX.

Who Should Care

Summary TLDR

M2A is a two-agent system (ChatAgent + MemoryManager) that keeps an editable multimodal memory across long, multi-session dialogs. Memory has two layers: an append-only RawMessageStore (full logs) and a SemanticMemoryStore (high-level summaries) linked by evidence IDs. Retrieval uses three parallel paths—dense text, BM25 sparse text, and cross-modal image embeddings—fused by Reciprocal Rank Fusion. M2A updates memory during conversation (Query → Generate → Update) and shows sizable accuracy gains on an enhanced LoCoMo benchmark (e.g., 44.64% vs 33.27% avg on GPT-4o-mini; 54.69% vs 43.95% on Qwen3-VL-8B). Ablations show dual-layer, iterative retrieval, and tri-path retrieval each contribute (

Problem Statement

Personalized multimodal assistants must remember and evolve user-specific concepts, names, images, and preferences across weeks or months. Current methods either bake concepts into fixed model tokens or store static profiles; both fail when users refine or correct concepts over time or when long conversations exceed the model context window. The practical need is an editable, multimodal memory that can be queried and updated autonomously during long-term interactions.

Main Contribution

Agentic online personalized memory: two cooperating agents let the system decide when to read or write user memory during conversation.

Dual-layer hybrid memory: RawMessageStore (immutable logs) + SemanticMemoryStore (high-level entries) linked by evidence IDs for progressive narrowing.

Key Findings

M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline

NumbersGPT-4o-mini Avg: M2A 44.64% vs RAG 33.27% (≈+11.4 pp)

Practical UseExpect noticeably better personalized answers on long, multi-session dialogs by adding editable multimodal memory and agentic updates.

Evidence RefTable 1; Section 5.2

Dual-layer memory and iterative retrieval materially boost accuracy

NumbersAblation on Qwen3-VL-8B: w/o Dual-layer −13.31 pp; w/o Iterative −16.02 pp

Practical UseKeep both semantic summaries and raw logs, and use multi-round retrieval rather than single-pass search to recover fine details.

Evidence RefTable 2; Section 5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4o-mini: M2A 44.64% | RAG 33.27% | Mem0 34.73% | A-MEM 36.26%RAG+11.37 pp vs RAGenhanced LoCoMo (all categories)Table 1, Section 5.2Table 1
AccuracyQwen3-VL-8B: M2A 54.69% | best baseline 43.95%best baseline+10.74 pp vs best baselineenhanced LoCoMo (all categories)Table 1, Section 5.2Table 1

What To Try In 7 Days

Prototype a dual-layer store: append-only raw logs + semantic summaries linked by evidence IDs.

Add tri-path retrieval (dense text, BM25, image embeddings) and fuse with RRF for robust recalls.

Implement a simple ChatAgent that triggers updates only on clear user-introduced facts to limit write-backs.

Agent Features

Memory
RawMessageStore (append-only full logs)SemanticMemoryStore (editable high-level entries)evidence IDs linking semantic entries to raw logs
Planning
ReAct-style Query → Generate → Update workflowiterative multi-round retrieval and refinement
Tool Use
memory query (read)memory update (create/delete/replace)fetch raw messages by ID ranges
Frameworks
ReAct-inspired agent workflowProgressive narrowing retrieval (semantic → raw)
Is Agentic

Yes

Architectures
two-agent (ChatAgent + MemoryManager)duallayer hybrid memory bank
Collaboration
ChatAgent decides when to query or updateMemoryManager executes read-write operations and reasoning

Optimization Features

Token Efficiency
semantic summaries reduce prompt size compared to full logs
System Optimization
Milvus vector store for semantic vectorsRRF fusion (k=60) to combine retrieval paths
Inference Optimization
vLLM for efficient local inferenceretrieve top-10 per path then fuse to limit candidates

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is on an enhanced, synthetic LoCoMo dataset with injected sessions; real-world distributions may differ.

LLM-as-a-judge binary scoring can be lenient on temporal phrasing and may mask fine-grained errors.

When Not To Use

When strict privacy rules forbid storing or editing user logs in external memory.

For latency-sensitive, single-turn tasks where full long-term memory provides no benefit.

Failure Modes

Stale or contradictory semantic entries if update/delete logic misfires.

Missed retrievals for rare aliases if BM25 or embeddings fail to match.

Core Entities

Models

M2AYo'LLaVAMC-LLaVAA-MEMMem0LoCoMoQwen3-VL-32BQwen3-VL-8BGPT-4oGPT-4o-miniGLM4.6V-Flash

Metrics

Accuracy

Datasets

LoCoMo (enhanced)Yo'LLaVA sessionsMC-LLaVA sessions

Benchmarks

enhanced LoCoMo