LOCOMO: a benchmark of very long, multimodal conversations to test LLM memory

February 27, 20248 min

Overview

Production Readiness

0.3

Novelty Score

0.7

Cost Impact Score

0.35

Citation Count

2

Authors

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang

Links

Abstract / PDF

Why It Matters For Business

Memory across many sessions matters for user retention and personalization; current LLMs make many factual and temporal errors, so products should combine retrieval of compact facts with human oversight for critical flows.

Summary TLDR

The paper introduces LOCOMO, a dataset and benchmark of 50 very-long multimodal conversations (≈300 turns, ≈9K tokens, up to 35 sessions) generated by LLM-based agents and cleaned by humans. It evaluates models on three tasks—question answering, event summarization, and multimodal dialogue generation—to test long-term memory, temporal/causal understanding, and multi-session consistency. Long-context LLMs and RAG help but still fall far below humans; models hallucinate, misattribute speakers, and struggle with temporal reasoning. The dataset and code are planned for public release.

Problem Statement

Existing benchmarks test short multi-session dialogs (∼1K tokens, ~5 sessions). We lack a standardized way to measure whether LLMs remember and reason across many sessions and multimodal signals. LOCOMO aims to fill that gap with very long, multi-session, multimodal conversations and tailored tasks to probe long-term memory, temporal/causal reasoning, and multimodal consistency.

Main Contribution

LOCOMO: a dataset of 50 very-long multimodal conversations (avg. 300 turns, 9K tokens, up to 35 sessions).

A human–machine pipeline: LLM-based generative agents (reflect & respond + image sharing) + temporal event graphs + human verification/editing.

A benchmark and tasks targeting long-term memory: QA (single/multi-hop, temporal, open-domain, adversarial), event-graph summarization, and multimodal dialog generation.

Extensive baselines: base LLMs, long-context LLMs, and RAG variants, plus a study of error modes and practical limitations.

Key Findings

Humans far outperform models on long-term QA.

NumbersHuman overall F1 87.9 vs best model ~37.8 (gpt-3.5-16k)

Long-context LLMs and RAG improve QA but still lag substantially.

NumbersRAG/long-context gains ~22–66% on some QA slices; still ~56% below human level

Temporal reasoning is especially weak.

NumbersTemporal QA gap vs humans ~73% (lower model F1)

RAG works best when retrieving compact 'observations' rather than raw dialog logs or summaries.

NumbersObservation-based RAG top-5 overall F1 ≈41.4 vs dialog-based lower or noisy gains

Long-context models are prone to hallucination and adversarial failure.

Numbersgpt-3.5-16k adversarial F1 dropped to 2.1% vs 70.2% for GPT-4 (4K)

Multimodal dialog models improve when trained with retrieved observations.

NumbersMiniGPT-5 +observation (top-5) MM-Relevance 57.8 vs base 56.1; BLEU-1 rises to 59.7 vs 57.1

Results

QA overall F1 (best model)

Value37.8

BaselineHuman 87.9

Observation-based RAG overall F1

Value41.4

BaselineNo retrieval 22.4

Event summarization FactScore F1

Value45.9

Baselinegpt-3.5-turbo-16k 39.9

Multimodal MM-Relevance

Value57.8

BaselineBase MiniGPT-5 56.1

Who Should Care

What To Try In 7 Days

Index user facts as short 'observations' and test RAG retrieval of top-5 observations.

Run event-based unit tests on your chatbot: date/sequence questions and speaker attribution checks.

Swap long raw transcripts for compact session summaries or observations before feeding the reader model.

Agent Features

Memory

  • short-term session summaries
  • long-term observations database

Planning

  • temporal event graph (simple causal timeline)

Tool Use

  • web image search (icrawler)
  • captioning (BLIP-2)

Frameworks

  • Park et al. (generative agent) style reflect-and-respond

Is Agentic

true

Architectures

  • generative agent (reflect & respond)

Collaboration

  • two-agent conversational setup with image sharing

Reproducibility

License

  • CC BY-NC 4.0 DEED

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset is LLM-generated then human-edited; may not capture all real-world conversational nuance.
  • Images are web-searched and lack personal visual continuity (no real photo album behavior).
  • Experiments rely on closed-source commercial LLMs; open-source parity not shown.
  • Automatic evaluation of long-form answers remains noisy despite design choices to extract exact answers.

When Not To Use

  • When you need real personal photo sequences or real-world longitudinal visual data.
  • When legal or privacy constraints require real human conversational consent and provenance.
  • As the sole validation for production safety-critical memory (use humans and rules).

Failure Modes

  • Hallucination: models invent facts or mix events.
  • Wrong speaker attribution: events assigned to incorrect person.
  • Missing temporal links: causal or time sequence omitted.
  • Adversarial brittleness: models accept traps as true under long context.
  • Image limitations: captions substitute images and may miss visual specifics.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4-turbo
  • gpt-3.5-turbo-16k
  • Llama-2-Chat-70B
  • Mistral-Instruct-7B
  • MiniGPT-5
  • BLIP-2
  • DRAGON

Metrics

  • F1 (answer prediction)
  • Recall@k
  • FactScore (precision/recall/F1)
  • ROUGE
  • BLEU
  • MM-Relevance

Datasets

  • LOCOMO
  • MSC
  • MMDialog
  • Conversation Chronicles
  • Daily Dialog

Benchmarks

  • LOCOMO benchmark (QA, Event Summarization, Multimodal Dialog)