M2A: editable dual-layer multimodal memory for evolving personalization

Overview

Decision SnapshotNeeds Validation

The system is modular and shows consistent gains on the provided long-dialog benchmark; ablations quantify component value, but production work remains for latency, privacy, and broader benchmarks.

Citations0

Evidence Strength0.90

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, Wentao Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

Editable multimodal memory lets assistants evolve with users (names, images, preferences) and yields measurable accuracy gains on long conversations—helpful for retention and personalized UX.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

M2A is a two-agent system (ChatAgent + MemoryManager) that keeps an editable multimodal memory across long, multi-session dialogs. Memory has two layers: an append-only RawMessageStore (full logs) and a SemanticMemoryStore (high-level summaries) linked by evidence IDs. Retrieval uses three parallel paths—dense text, BM25 sparse text, and cross-modal image embeddings—fused by Reciprocal Rank Fusion. M2A updates memory during conversation (Query → Generate → Update) and shows sizable accuracy gains on an enhanced LoCoMo benchmark (e.g., 44.64% vs 33.27% avg on GPT-4o-mini; 54.69% vs 43.95% on Qwen3-VL-8B). Ablations show dual-layer, iterative retrieval, and tri-path retrieval each contribute (

Problem Statement

Personalized multimodal assistants must remember and evolve user-specific concepts, names, images, and preferences across weeks or months. Current methods either bake concepts into fixed model tokens or store static profiles; both fail when users refine or correct concepts over time or when long conversations exceed the model context window. The practical need is an editable, multimodal memory that can be queried and updated autonomously during long-term interactions.

Main Contribution

Agentic online personalized memory: two cooperating agents let the system decide when to read or write user memory during conversation.

Dual-layer hybrid memory: RawMessageStore (immutable logs) + SemanticMemoryStore (high-level entries) linked by evidence IDs for progressive narrowing.

Key Findings

M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline

NumbersGPT-4o-mini Avg: M2A 44.64% vs RAG 33.27% (≈+11.4 pp)

Practical UseExpect noticeably better personalized answers on long, multi-session dialogs by adding editable multimodal memory and agentic updates.

Evidence RefTable 1; Section 5.2

Dual-layer memory and iterative retrieval materially boost accuracy

NumbersAblation on Qwen3-VL-8B: w/o Dual-layer −13.31 pp; w/o Iterative −16.02 pp

Practical UseKeep both semantic summaries and raw logs, and use multi-round retrieval rather than single-pass search to recover fine details.

Evidence RefTable 2; Section 5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4o-mini: M2A 44.64% \| RAG 33.27% \| Mem0 34.73% \| A-MEM 36.26%	RAG	+11.37 pp vs RAG	enhanced LoCoMo (all categories)	Table 1, Section 5.2	Table 1
Accuracy	Qwen3-VL-8B: M2A 54.69% \| best baseline 43.95%	best baseline	+10.74 pp vs best baseline	enhanced LoCoMo (all categories)	Table 1, Section 5.2	Table 1

What To Try In 7 Days

Prototype a dual-layer store: append-only raw logs + semantic summaries linked by evidence IDs.

Add tri-path retrieval (dense text, BM25, image embeddings) and fuse with RRF for robust recalls.

Implement a simple ChatAgent that triggers updates only on clear user-introduced facts to limit write-backs.

Agent Features

Memory

RawMessageStore (append-only full logs)SemanticMemoryStore (editable high-level entries)evidence IDs linking semantic entries to raw logs

Planning

ReAct-style Query → Generate → Update workflowiterative multi-round retrieval and refinement

Tool Use

memory query (read)memory update (create/delete/replace)fetch raw messages by ID ranges

Frameworks

ReAct-inspired agent workflowProgressive narrowing retrieval (semantic → raw)

Is Agentic

Yes

Architectures

two-agent (ChatAgent + MemoryManager)duallayer hybrid memory bank

Collaboration

ChatAgent decides when to query or updateMemoryManager executes read-write operations and reasoning

Optimization Features

Token Efficiency

semantic summaries reduce prompt size compared to full logs

System Optimization

Milvus vector store for semantic vectorsRRF fusion (k=60) to combine retrieval paths

Inference Optimization

vLLM for efficient local inferenceretrieve top-10 per path then fuse to limit candidates

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Little-Fridge/M2A

Risks & Boundaries

Limitations

Evaluation is on an enhanced, synthetic LoCoMo dataset with injected sessions; real-world distributions may differ.

LLM-as-a-judge binary scoring can be lenient on temporal phrasing and may mask fine-grained errors.

When Not To Use

When strict privacy rules forbid storing or editing user logs in external memory.

For latency-sensitive, single-turn tasks where full long-term memory provides no benefit.

Failure Modes

Stale or contradictory semantic entries if update/delete logic misfires.

Missed retrievals for rare aliases if BM25 or embeddings fail to match.

Core Entities

Models

M2AYo'LLaVAMC-LLaVAA-MEMMem0LoCoMoQwen3-VL-32BQwen3-VL-8BGPT-4oGPT-4o-miniGLM4.6V-Flash

Metrics

Accuracy

Datasets

LoCoMo (enhanced)Yo'LLaVA sessionsMC-LLaVA sessions

Benchmarks

enhanced LoCoMo

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline

Dual-layer memory and iterative retrieval materially boost accuracy

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding