Overview
The benchmark is a useful diagnostic for long-term memory; models show clear weaknesses but fixes require retrieval, structure, and human checks rather than a single model swap.
Citations2
Evidence Strength0.85
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
License: CC BY-NC 4.0 DEED
At A Glance
Cost impact: 35%
Production readiness: 30%
Novelty: 70%
Why It Matters For Business
Memory across many sessions matters for user retention and personalization; current LLMs make many factual and temporal errors, so products should combine retrieval of compact facts with human oversight for critical flows.
Who Should Care
Summary TLDR
The paper introduces LOCOMO, a dataset and benchmark of 50 very-long multimodal conversations (≈300 turns, ≈9K tokens, up to 35 sessions) generated by LLM-based agents and cleaned by humans. It evaluates models on three tasks—question answering, event summarization, and multimodal dialogue generation—to test long-term memory, temporal/causal understanding, and multi-session consistency. Long-context LLMs and RAG help but still fall far below humans; models hallucinate, misattribute speakers, and struggle with temporal reasoning. The dataset and code are planned for public release.
Problem Statement
Existing benchmarks test short multi-session dialogs (∼1K tokens, ~5 sessions). We lack a standardized way to measure whether LLMs remember and reason across many sessions and multimodal signals. LOCOMO aims to fill that gap with very long, multi-session, multimodal conversations and tailored tasks to probe long-term memory, temporal/causal reasoning, and multimodal consistency.
Main Contribution
LOCOMO: a dataset of 50 very-long multimodal conversations (avg. 300 turns, 9K tokens, up to 35 sessions).
A human–machine pipeline: LLM-based generative agents (reflect & respond + image sharing) + temporal event graphs + human verification/editing.
Key Findings
Humans far outperform models on long-term QA.
Long-context LLMs and RAG improve QA but still lag substantially.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| QA overall F1 (best model) | 37.8 | Human 87.9 | -50.1 | LOCOMO QA (all categories) | Table 2: gpt-3.5-turbo-16k overall F1 37.8; Human 87.9 | Table 2 |
| Observation-based RAG overall F1 | 41.4 | No retrieval 22.4 | +19.0 | LOCOMO QA (Observation top-5) | Table 3: Observation top-5 overall F1 41.4 vs none 22.4 | Table 3 |
What To Try In 7 Days
Index user facts as short 'observations' and test RAG retrieval of top-5 observations.
Run event-based unit tests on your chatbot: date/sequence questions and speaker attribution checks.
Swap long raw transcripts for compact session summaries or observations before feeding the reader model.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Dataset is LLM-generated then human-edited; may not capture all real-world conversational nuance.
Images are web-searched and lack personal visual continuity (no real photo album behavior).
When Not To Use
When you need real personal photo sequences or real-world longitudinal visual data.
When legal or privacy constraints require real human conversational consent and provenance.
Failure Modes
Hallucination: models invent facts or mix events.
Wrong speaker attribution: events assigned to incorrect person.

