Overview
Production Readiness
0.3
Novelty Score
0.7
Cost Impact Score
0.35
Citation Count
2
Why It Matters For Business
Memory across many sessions matters for user retention and personalization; current LLMs make many factual and temporal errors, so products should combine retrieval of compact facts with human oversight for critical flows.
Summary TLDR
The paper introduces LOCOMO, a dataset and benchmark of 50 very-long multimodal conversations (≈300 turns, ≈9K tokens, up to 35 sessions) generated by LLM-based agents and cleaned by humans. It evaluates models on three tasks—question answering, event summarization, and multimodal dialogue generation—to test long-term memory, temporal/causal understanding, and multi-session consistency. Long-context LLMs and RAG help but still fall far below humans; models hallucinate, misattribute speakers, and struggle with temporal reasoning. The dataset and code are planned for public release.
Problem Statement
Existing benchmarks test short multi-session dialogs (∼1K tokens, ~5 sessions). We lack a standardized way to measure whether LLMs remember and reason across many sessions and multimodal signals. LOCOMO aims to fill that gap with very long, multi-session, multimodal conversations and tailored tasks to probe long-term memory, temporal/causal reasoning, and multimodal consistency.
Main Contribution
LOCOMO: a dataset of 50 very-long multimodal conversations (avg. 300 turns, 9K tokens, up to 35 sessions).
A human–machine pipeline: LLM-based generative agents (reflect & respond + image sharing) + temporal event graphs + human verification/editing.
A benchmark and tasks targeting long-term memory: QA (single/multi-hop, temporal, open-domain, adversarial), event-graph summarization, and multimodal dialog generation.
Extensive baselines: base LLMs, long-context LLMs, and RAG variants, plus a study of error modes and practical limitations.
Key Findings
Humans far outperform models on long-term QA.
Long-context LLMs and RAG improve QA but still lag substantially.
Temporal reasoning is especially weak.
RAG works best when retrieving compact 'observations' rather than raw dialog logs or summaries.
Long-context models are prone to hallucination and adversarial failure.
Multimodal dialog models improve when trained with retrieved observations.
Results
QA overall F1 (best model)
Observation-based RAG overall F1
Event summarization FactScore F1
Multimodal MM-Relevance
Who Should Care
What To Try In 7 Days
Index user facts as short 'observations' and test RAG retrieval of top-5 observations.
Run event-based unit tests on your chatbot: date/sequence questions and speaker attribution checks.
Swap long raw transcripts for compact session summaries or observations before feeding the reader model.
Agent Features
Memory
- short-term session summaries
- long-term observations database
Planning
- temporal event graph (simple causal timeline)
Tool Use
- web image search (icrawler)
- captioning (BLIP-2)
Frameworks
- Park et al. (generative agent) style reflect-and-respond
Is Agentic
true
Architectures
- generative agent (reflect & respond)
Collaboration
- two-agent conversational setup with image sharing
Reproducibility
License
- CC BY-NC 4.0 DEED
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset is LLM-generated then human-edited; may not capture all real-world conversational nuance.
- Images are web-searched and lack personal visual continuity (no real photo album behavior).
- Experiments rely on closed-source commercial LLMs; open-source parity not shown.
- Automatic evaluation of long-form answers remains noisy despite design choices to extract exact answers.
When Not To Use
- When you need real personal photo sequences or real-world longitudinal visual data.
- When legal or privacy constraints require real human conversational consent and provenance.
- As the sole validation for production safety-critical memory (use humans and rules).
Failure Modes
- Hallucination: models invent facts or mix events.
- Wrong speaker attribution: events assigned to incorrect person.
- Missing temporal links: causal or time sequence omitted.
- Adversarial brittleness: models accept traps as true under long context.
- Image limitations: captions substitute images and may miss visual specifics.
Core Entities
Models
- gpt-3.5-turbo
- gpt-4-turbo
- gpt-3.5-turbo-16k
- Llama-2-Chat-70B
- Mistral-Instruct-7B
- MiniGPT-5
- BLIP-2
- DRAGON
Metrics
- F1 (answer prediction)
- Recall@k
- FactScore (precision/recall/F1)
- ROUGE
- BLEU
- MM-Relevance
Datasets
- LOCOMO
- MSC
- MMDialog
- Conversation Chronicles
- Daily Dialog
Benchmarks
- LOCOMO benchmark (QA, Event Summarization, Multimodal Dialog)

