LOCOMO: a benchmark of very long, multimodal conversations to test LLM memory

February 27, 20248 min

Overview

Decision SnapshotNeeds Validation

The benchmark is a useful diagnostic for long-term memory; models show clear weaknesses but fixes require retrieval, structure, and human checks rather than a single model swap.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CC BY-NC 4.0 DEED

At A Glance

Cost impact: 35%

Production readiness: 30%

Novelty: 70%

Authors

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Memory across many sessions matters for user retention and personalization; current LLMs make many factual and temporal errors, so products should combine retrieval of compact facts with human oversight for critical flows.

Who Should Care

Summary TLDR

The paper introduces LOCOMO, a dataset and benchmark of 50 very-long multimodal conversations (≈300 turns, ≈9K tokens, up to 35 sessions) generated by LLM-based agents and cleaned by humans. It evaluates models on three tasks—question answering, event summarization, and multimodal dialogue generation—to test long-term memory, temporal/causal understanding, and multi-session consistency. Long-context LLMs and RAG help but still fall far below humans; models hallucinate, misattribute speakers, and struggle with temporal reasoning. The dataset and code are planned for public release.

Problem Statement

Existing benchmarks test short multi-session dialogs (∼1K tokens, ~5 sessions). We lack a standardized way to measure whether LLMs remember and reason across many sessions and multimodal signals. LOCOMO aims to fill that gap with very long, multi-session, multimodal conversations and tailored tasks to probe long-term memory, temporal/causal reasoning, and multimodal consistency.

Main Contribution

LOCOMO: a dataset of 50 very-long multimodal conversations (avg. 300 turns, 9K tokens, up to 35 sessions).

A human–machine pipeline: LLM-based generative agents (reflect & respond + image sharing) + temporal event graphs + human verification/editing.

Key Findings

Humans far outperform models on long-term QA.

NumbersHuman overall F1 87.9 vs best model ~37.8 (gpt-3.5-16k)

Practical UseDo not expect current LLMs to match humans on very long multi-session memory tasks; use human oversight for critical memory-sensitive flows.

Evidence RefTable 2

Long-context LLMs and RAG improve QA but still lag substantially.

NumbersRAG/long-context gains ~2266% on some QA slices; still ~56% below human level

Practical UseUse long-context models or RAG to boost recall, but validate with task-specific checks; improvements are helpful but incomplete.

Evidence RefAbstract, Table 2, Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
QA overall F1 (best model)37.8Human 87.9-50.1LOCOMO QA (all categories)Table 2: gpt-3.5-turbo-16k overall F1 37.8; Human 87.9Table 2
Observation-based RAG overall F141.4No retrieval 22.4+19.0LOCOMO QA (Observation top-5)Table 3: Observation top-5 overall F1 41.4 vs none 22.4Table 3

What To Try In 7 Days

Index user facts as short 'observations' and test RAG retrieval of top-5 observations.

Run event-based unit tests on your chatbot: date/sequence questions and speaker attribution checks.

Swap long raw transcripts for compact session summaries or observations before feeding the reader model.

Agent Features

Memory
short-term session summarieslong-term observations database
Planning
temporal event graph (simple causal timeline)
Tool Use
web image search (icrawler)captioning (BLIP-2)
Frameworks
Park et al. (generative agent) style reflect-and-respond
Is Agentic

Yes

Architectures
generative agent (reflect & respond)
Collaboration
two-agent conversational setup with image sharing

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCC BY-NC 4.0 DEED

Risks & Boundaries

Limitations

Dataset is LLM-generated then human-edited; may not capture all real-world conversational nuance.

Images are web-searched and lack personal visual continuity (no real photo album behavior).

When Not To Use

When you need real personal photo sequences or real-world longitudinal visual data.

When legal or privacy constraints require real human conversational consent and provenance.

Failure Modes

Hallucination: models invent facts or mix events.

Wrong speaker attribution: events assigned to incorrect person.

Core Entities

Models

gpt-3.5-turbogpt-4-turbogpt-3.5-turbo-16kLlama-2-Chat-70BMistral-Instruct-7BMiniGPT-5BLIP-2DRAGON

Metrics

F1 (answer prediction)Recall@kFactScore (precision/recall/F1)ROUGEBLEUMM-Relevance

Datasets

LOCOMOMSCMMDialogConversation ChroniclesDaily Dialog

Benchmarks

LOCOMO benchmark (QA, Event Summarization, Multimodal Dialog)