LIFESTATE-BENCH: fact-based episodic tests that measure whether LLMs form and keep story-like memory

March 30, 20256 min

Overview

Decision SnapshotNeeds Validation

The benchmark is a practical diagnostic: evidence shows non-parametric context helps, but limited sample size and potential Hamlet contamination lower generality; treat results as a directional guide rather than definitive.

Citations1

Evidence Strength0.65

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 35%

Production readiness: 40%

Novelty: 60%

Authors

Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

Links

Abstract / PDF

Why It Matters For Business

If your product uses chat agents over long sessions, external context or retrieval beats one-off parameter edits for keeping factual state; otherwise agents will lose facts and relationship context as conversations grow.

Who Should Care

Summary TLDR

This paper introduces LIFESTATE-BENCH, a benchmark of episodic, multi-agent narratives (Hamlet + synthetic scripts) that measures whether LLMs form and retain a changing internal "state" across episodes. It tests three state dimensions: self-awareness, factual episode memory, and relationship shifts. Experiments across Llama3.1-8B, GPT-4-turbo, and DeepSeek-R1 show non-parametric memory (direct or summary concatenation) outperforms parameter edits (knowledge editing, LoRA). Best overall accuracies ranged ~67% (DeepSeek-R1, Hamlet, direct concat) to ~76% (GPT-4-turbo, Synthetic, direct concat). All models degrade over episodes and struggle most with relationship-shift questions, indicating a)

Problem Statement

Existing benchmarks focus on static or open-ended dialogue and miss whether an LLM can develop and keep a changing internal state during long, multi-agent interactions. LIFESTATE-BENCH fills that gap by supplying episodic timelines and fact-checked questions to evaluate self-awareness, long-term factual memory, and relationship changes over multiple episodes.

Main Contribution

LIFESTATE-BENCH: an episodic benchmark (Hamlet + synthetic scripts) that forces cumulative experience and fact-checked evaluation.

A three-dimension test suite: self-awareness, factual episode memory retrieval, and relationship-shift questions with ground-truth answers.

Key Findings

Non-parametric context methods beat parametric tuning for episodic memory tasks

NumbersDeepSeek-R1 Hamlet direct concat 67.3% vs Llama3.1 LoRA ~25% (on same dataset)

Practical UsePrefer feeding history (direct or summarized) at inference time before trying weight edits for episodic memory; it gives substantially higher factual accuracy on evaluated stories.

Evidence RefSection 5.2, Table 3

Models forget as episodes accumulate (catastrophic forgetting)

NumbersPerformance on Hamlet drops across episodes; parametric edits decline fastest (plots in Figure 3)

Practical UseExpect accuracy loss over long interactions; add explicit external memory or retrieval for long-running agents rather than relying on single-shot parameter edits.

Evidence RefSection 5.2 Episode-wise Performance, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy67.3%LIFESTATE-BENCH HamletTable 3; Section 5.2Table 3
Accuracy75.6%LIFESTATE-BENCH SyntheticTable 3; Section 5.2Table 3

What To Try In 7 Days

Run a short episodic test: feed 3–6 prior interactions via direct concatenation and compare answers vs no history.

Measure relationship-tracking by asking targeted relation-change questions after simulated episodes.

Replace weight-edit attempts with a summarized context layer (GPT-driven summaries) and compare accuracy and latency.

Agent Features

Memory
short-term (context window)long-term (episodic summaries)LoRAnon-parametric memory (direct concatenation)
Tool Use
external summarization (GPT)
Frameworks
LIFESTATE-BENCH
Is Agentic

Yes

Architectures
instruct/chat LLMMoE
Collaboration
multi-agent interactions

Optimization Features

Token Efficiency
summary concatenation (context compression)
Training Optimization
LoRA

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Overall dataset size is limited, which may reduce diversity of scenarios.

Hamlet samples risk data contamination despite name replacement; pretraining leakage likely.

When Not To Use

For large-scale production tuning where diverse domain data is required (dataset is small).

To claim general lifelong learning across domains without further validation.

Failure Modes

Catastrophic forgetting when using parametric edits across episodes.

Poor accuracy on relationship-shift questions.

Core Entities

Models

Llama3.1-8BGPT-4-turboDeepSeek-R1

Metrics

AccuracyStd (per-question)LLM-as-judge score (1-100)

Datasets

LIFESTATE-BENCH-HamletLIFESTATE-BENCH-Synth

Benchmarks

LIFESTATE-BENCH

Context Entities

Models

Meta Llama 3.1

Metrics

PPLROUGEF1

Datasets

Persona-ChatRoleLLMCharacter-LLMSocialBench

Benchmarks

LongBenchL-EvalMT-Bench