LIFESTATE-BENCH: fact-based episodic tests that measure whether LLMs form and keep story-like memory

March 30, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.35

Citation Count

1

Authors

Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

Links

Abstract / PDF

Why It Matters For Business

If your product uses chat agents over long sessions, external context or retrieval beats one-off parameter edits for keeping factual state; otherwise agents will lose facts and relationship context as conversations grow.

Summary TLDR

This paper introduces LIFESTATE-BENCH, a benchmark of episodic, multi-agent narratives (Hamlet + synthetic scripts) that measures whether LLMs form and retain a changing internal "state" across episodes. It tests three state dimensions: self-awareness, factual episode memory, and relationship shifts. Experiments across Llama3.1-8B, GPT-4-turbo, and DeepSeek-R1 show non-parametric memory (direct or summary concatenation) outperforms parameter edits (knowledge editing, LoRA). Best overall accuracies ranged ~67% (DeepSeek-R1, Hamlet, direct concat) to ~76% (GPT-4-turbo, Synthetic, direct concat). All models degrade over episodes and struggle most with relationship-shift questions, indicating a)

Problem Statement

Existing benchmarks focus on static or open-ended dialogue and miss whether an LLM can develop and keep a changing internal state during long, multi-agent interactions. LIFESTATE-BENCH fills that gap by supplying episodic timelines and fact-checked questions to evaluate self-awareness, long-term factual memory, and relationship changes over multiple episodes.

Main Contribution

LIFESTATE-BENCH: an episodic benchmark (Hamlet + synthetic scripts) that forces cumulative experience and fact-checked evaluation.

A three-dimension test suite: self-awareness, factual episode memory retrieval, and relationship-shift questions with ground-truth answers.

A controlled comparison of memory methods: non-parametric (direct/summary concatenation) vs parametric (knowledge editing, LoRA).

Key Findings

Non-parametric context methods beat parametric tuning for episodic memory tasks

NumbersDeepSeek-R1 Hamlet direct concat 67.3% vs Llama3.1 LoRA ~25% (on same dataset)

Models forget as episodes accumulate (catastrophic forgetting)

NumbersPerformance on Hamlet drops across episodes; parametric edits decline fastest (plots in Figure 3)

Tracking relationship shifts is the hardest subtask

NumbersRelation-shift scores were lowest across methods (e.g., Llama3.1 relation shift ~19–45% vs self-awareness 67–86%)

Results

Accuracy

Value67.3%

Accuracy

Value75.6%

Self-awareness (Hamlet), DeepSeek-R1, Direct Concatenation

Value86.4%

LoRA

ValueDirect concat ~58.0% vs LoRA ~25.6%

Who Should Care

What To Try In 7 Days

Run a short episodic test: feed 3–6 prior interactions via direct concatenation and compare answers vs no history.

Measure relationship-tracking by asking targeted relation-change questions after simulated episodes.

Replace weight-edit attempts with a summarized context layer (GPT-driven summaries) and compare accuracy and latency.

Agent Features

Memory

  • short-term (context window)
  • long-term (episodic summaries)
  • LoRA
  • non-parametric memory (direct concatenation)

Tool Use

  • external summarization (GPT)

Frameworks

  • LIFESTATE-BENCH

Is Agentic

true

Architectures

  • instruct/chat LLM
  • MoE

Collaboration

  • multi-agent interactions

Optimization Features

Token Efficiency

  • summary concatenation (context compression)

Training Optimization

  • LoRA

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Overall dataset size is limited, which may reduce diversity of scenarios.
  • Hamlet samples risk data contamination despite name replacement; pretraining leakage likely.
  • Question-answer annotations were author-generated, introducing potential bias.

When Not To Use

  • For large-scale production tuning where diverse domain data is required (dataset is small).
  • To claim general lifelong learning across domains without further validation.

Failure Modes

  • Catastrophic forgetting when using parametric edits across episodes.
  • Poor accuracy on relationship-shift questions.
  • Possible overestimation of capability on canonical stories due to pretraining leakage.

Core Entities

Models

  • Llama3.1-8B
  • GPT-4-turbo
  • DeepSeek-R1

Metrics

  • Accuracy
  • Std (per-question)
  • LLM-as-judge score (1-100)

Datasets

  • LIFESTATE-BENCH-Hamlet
  • LIFESTATE-BENCH-Synth

Benchmarks

  • LIFESTATE-BENCH

Context Entities

Models

  • Meta Llama 3.1

Metrics

  • PPL
  • ROUGE
  • F1

Datasets

  • Persona-Chat
  • RoleLLM
  • Character-LLM
  • SocialBench

Benchmarks

  • LongBench
  • L-Eval
  • MT-Bench