M2A: editable dual-layer multimodal memory for evolving personalization

February 7, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, Wentao Zhang

Links

Abstract / PDF

Why It Matters For Business

Editable multimodal memory lets assistants evolve with users (names, images, preferences) and yields measurable accuracy gains on long conversations—helpful for retention and personalized UX.

Summary TLDR

M2A is a two-agent system (ChatAgent + MemoryManager) that keeps an editable multimodal memory across long, multi-session dialogs. Memory has two layers: an append-only RawMessageStore (full logs) and a SemanticMemoryStore (high-level summaries) linked by evidence IDs. Retrieval uses three parallel paths—dense text, BM25 sparse text, and cross-modal image embeddings—fused by Reciprocal Rank Fusion. M2A updates memory during conversation (Query → Generate → Update) and shows sizable accuracy gains on an enhanced LoCoMo benchmark (e.g., 44.64% vs 33.27% avg on GPT-4o-mini; 54.69% vs 43.95% on Qwen3-VL-8B). Ablations show dual-layer, iterative retrieval, and tri-path retrieval each contribute (

Problem Statement

Personalized multimodal assistants must remember and evolve user-specific concepts, names, images, and preferences across weeks or months. Current methods either bake concepts into fixed model tokens or store static profiles; both fail when users refine or correct concepts over time or when long conversations exceed the model context window. The practical need is an editable, multimodal memory that can be queried and updated autonomously during long-term interactions.

Main Contribution

Agentic online personalized memory: two cooperating agents let the system decide when to read or write user memory during conversation.

Dual-layer hybrid memory: RawMessageStore (immutable logs) + SemanticMemoryStore (high-level entries) linked by evidence IDs for progressive narrowing.

Tri-path multimodal retrieval: dense text, BM25 sparse text, and cross-modal image embeddings fused by Reciprocal Rank Fusion.

Reusable multimodal data synthesis: injects concept-grounded sessions into long dialogs to train and evaluate memory-driven personalization.

Key Findings

M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline

NumbersGPT-4o-mini Avg: M2A 44.64% vs RAG 33.27% (≈+11.4 pp)

Dual-layer memory and iterative retrieval materially boost accuracy

NumbersAblation on Qwen3-VL-8B: w/o Dual-layer −13.31 pp; w/o Iterative −16.02 pp

Tri-path retrieval improves robustness to names, semantics, and images

NumbersAblation: w/o Tri-path −4.10 pp on Qwen3-VL-8B

Results

Accuracy

ValueGPT-4o-mini: M2A 44.64% | RAG 33.27% | Mem0 34.73% | A-MEM 36.26%

BaselineRAG

Accuracy

ValueQwen3-VL-8B: M2A 54.69% | best baseline 43.95%

Baselinebest baseline

Ablation: remove dual-layer memory

ValueAvg drops by 13.31 percentage points

BaselineM2A full

Ablation: remove iterative retrieval

ValueAvg drops by 16.02 percentage points

BaselineM2A full

Dataset size

Value10 long conversations, avg 621 turns, ~10k tokens, 214 images injected

Who Should Care

What To Try In 7 Days

Prototype a dual-layer store: append-only raw logs + semantic summaries linked by evidence IDs.

Add tri-path retrieval (dense text, BM25, image embeddings) and fuse with RRF for robust recalls.

Implement a simple ChatAgent that triggers updates only on clear user-introduced facts to limit write-backs.

Agent Features

Memory

  • RawMessageStore (append-only full logs)
  • SemanticMemoryStore (editable high-level entries)
  • evidence IDs linking semantic entries to raw logs

Planning

  • ReAct-style Query → Generate → Update workflow
  • iterative multi-round retrieval and refinement

Tool Use

  • memory query (read)
  • memory update (create/delete/replace)
  • fetch raw messages by ID ranges

Frameworks

  • ReAct-inspired agent workflow
  • Progressive narrowing retrieval (semantic → raw)

Is Agentic

true

Architectures

  • two-agent (ChatAgent + MemoryManager)
  • duallayer hybrid memory bank

Collaboration

  • ChatAgent decides when to query or update
  • MemoryManager executes read-write operations and reasoning

Optimization Features

Token Efficiency

  • semantic summaries reduce prompt size compared to full logs

System Optimization

  • Milvus vector store for semantic vectors
  • RRF fusion (k=60) to combine retrieval paths

Inference Optimization

  • vLLM for efficient local inference
  • retrieve top-10 per path then fuse to limit candidates

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation is on an enhanced, synthetic LoCoMo dataset with injected sessions; real-world distributions may differ.
  • LLM-as-a-judge binary scoring can be lenient on temporal phrasing and may mask fine-grained errors.
  • Memory operations add compute and storage overhead (vector indices, captioning, iterative retrieval).
  • Quality of image captioning and cross-modal embeddings affects recall for visual-centric queries.

When Not To Use

  • When strict privacy rules forbid storing or editing user logs in external memory.
  • For latency-sensitive, single-turn tasks where full long-term memory provides no benefit.
  • If deployment cannot support vector stores, image embedding pipelines, or iterative agent loops.

Failure Modes

  • Stale or contradictory semantic entries if update/delete logic misfires.
  • Missed retrievals for rare aliases if BM25 or embeddings fail to match.
  • Hallucinated captions or misaligned cross-modal embeddings causing wrong image-based recalls.
  • Overwriting correct long-term facts if automated updates are too aggressive.

Core Entities

Models

  • M2A
  • Yo'LLaVA
  • MC-LLaVA
  • A-MEM
  • Mem0
  • LoCoMo
  • Qwen3-VL-32B
  • Qwen3-VL-8B
  • GPT-4o
  • GPT-4o-mini
  • GLM4.6V-Flash

Metrics

  • Accuracy

Datasets

  • LoCoMo (enhanced)
  • Yo'LLaVA sessions
  • MC-LLaVA sessions

Benchmarks

  • enhanced LoCoMo