LOCOMO: a benchmark of very long, multimodal conversations to test LLM memory

Overview

Decision SnapshotNeeds Validation

The benchmark is a useful diagnostic for long-term memory; models show clear weaknesses but fixes require retrieval, structure, and human checks rather than a single model swap.

Citations2

Evidence Strength0.85

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CC BY-NC 4.0 DEED

At A Glance

Cost impact: 35%

Production readiness: 30%

Novelty: 70%

Authors

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Memory across many sessions matters for user retention and personalization; current LLMs make many factual and temporal errors, so products should combine retrieval of compact facts with human oversight for critical flows.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

The paper introduces LOCOMO, a dataset and benchmark of 50 very-long multimodal conversations (≈300 turns, ≈9K tokens, up to 35 sessions) generated by LLM-based agents and cleaned by humans. It evaluates models on three tasks—question answering, event summarization, and multimodal dialogue generation—to test long-term memory, temporal/causal understanding, and multi-session consistency. Long-context LLMs and RAG help but still fall far below humans; models hallucinate, misattribute speakers, and struggle with temporal reasoning. The dataset and code are planned for public release.

Problem Statement

Existing benchmarks test short multi-session dialogs (∼1K tokens, ~5 sessions). We lack a standardized way to measure whether LLMs remember and reason across many sessions and multimodal signals. LOCOMO aims to fill that gap with very long, multi-session, multimodal conversations and tailored tasks to probe long-term memory, temporal/causal reasoning, and multimodal consistency.

Main Contribution

LOCOMO: a dataset of 50 very-long multimodal conversations (avg. 300 turns, 9K tokens, up to 35 sessions).

A human–machine pipeline: LLM-based generative agents (reflect & respond + image sharing) + temporal event graphs + human verification/editing.

Key Findings

Humans far outperform models on long-term QA.

NumbersHuman overall F1 87.9 vs best model ~37.8 (gpt-3.5-16k)

Practical UseDo not expect current LLMs to match humans on very long multi-session memory tasks; use human oversight for critical memory-sensitive flows.

Evidence RefTable 2

Long-context LLMs and RAG improve QA but still lag substantially.

NumbersRAG/long-context gains ~22–66% on some QA slices; still ~56% below human level

Practical UseUse long-context models or RAG to boost recall, but validate with task-specific checks; improvements are helpful but incomplete.

Evidence RefAbstract, Table 2, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
QA overall F1 (best model)	37.8	Human 87.9	-50.1	LOCOMO QA (all categories)	Table 2: gpt-3.5-turbo-16k overall F1 37.8; Human 87.9	Table 2
Observation-based RAG overall F1	41.4	No retrieval 22.4	+19.0	LOCOMO QA (Observation top-5)	Table 3: Observation top-5 overall F1 41.4 vs none 22.4	Table 3

What To Try In 7 Days

Index user facts as short 'observations' and test RAG retrieval of top-5 observations.

Run event-based unit tests on your chatbot: date/sequence questions and speaker attribution checks.

Swap long raw transcripts for compact session summaries or observations before feeding the reader model.

Agent Features

Memory

short-term session summarieslong-term observations database

Planning

temporal event graph (simple causal timeline)

Tool Use

web image search (icrawler)captioning (BLIP-2)

Frameworks

Park et al. (generative agent) style reflect-and-respond

Is Agentic

Yes

Architectures

generative agent (reflect & respond)

Collaboration

two-agent conversational setup with image sharing

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCC BY-NC 4.0 DEED

Code URLs

https://snap-research.github.io/locomo/

Data URLs

https://snap-research.github.io/locomo/

Risks & Boundaries

Limitations

Dataset is LLM-generated then human-edited; may not capture all real-world conversational nuance.

Images are web-searched and lack personal visual continuity (no real photo album behavior).

When Not To Use

When you need real personal photo sequences or real-world longitudinal visual data.

When legal or privacy constraints require real human conversational consent and provenance.

Failure Modes

Hallucination: models invent facts or mix events.

Wrong speaker attribution: events assigned to incorrect person.

Core Entities

Models

gpt-3.5-turbogpt-4-turbogpt-3.5-turbo-16kLlama-2-Chat-70BMistral-Instruct-7BMiniGPT-5BLIP-2DRAGON

Metrics

F1 (answer prediction)Recall@kFactScore (precision/recall/F1)ROUGEBLEUMM-Relevance

Datasets

LOCOMOMSCMMDialogConversation ChroniclesDaily Dialog

Benchmarks

LOCOMO benchmark (QA, Event Summarization, Multimodal Dialog)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Humans far outperform models on long-term QA.

Long-context LLMs and RAG improve QA but still lag substantially.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding