Giving LLM agents a memory 'norm' can sometimes make their emotions more human-like — but results are mixed

February 6, 20247 min

Overview

Decision SnapshotNeeds Validation

The architecture is simple and implementable, but evidence is preliminary (428 short stories, single scoring LLM) and lacks human PANAS validation, so treat results as exploratory.

Citations1

Evidence Strength0.50

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Ciaran Regan, Nanami Iwahashi, Shogo Tanaka, Mizuki Oka

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Adding an explicit memory-summary step can make agent responses more context-aware and slightly more willing to register negative emotions, which matters for chatbots, virtual characters, and user-state tracking but needs human validation before deployment.

Who Should Care

Summary TLDR

The authors add a simple episodic-memory step to generative agents: for each new text experience the agent summarizes relevant past memories into a 'norm', compares the new input to that norm, then scores affect with PANAS (via GPT-3.5-Turbo). They convert 428 EmotionBench scenarios into 5-part stories and run agents with and without the norm. Context sometimes improves emotional alignment (notably increased negative affect), but effects are small and inconsistent. Results are limited by using GPT-3.5-Turbo for scoring, five-scene stories, and no human PANAS comparison.

Problem Statement

LLMs can guess emotions but lack episode-based context that humans use to appraise events. The paper asks whether adding a memory-derived "norm" (a summary of past episodic memories) helps generative agents form emotions that match human expectations.

Main Contribution

A simple agent architecture that builds a per-event "norm" from past episodic memories, compares the new event to that norm, and uses the comparison when scoring affect.

A dataset of 428 five-scene stories derived from EmotionBench to test how emotions evolve as agents perceive sequential experiences.

Key Findings

Adding the norm increased average negative affect across EmotionBench compared to no-norm agents.

Numbers+1.6 (overall negative affect increase with norm)

Practical UseIf you want agents to register slightly more negative emotion in negative scenarios, provide a memory-based context; expect a small effect size.

Evidence RefTable 6 (Overall row)

Positive affect dropped substantially when agents saw negative scenarios.

Numbers−18.0 (overall positive affect decrease with norm)

Practical UseContext tends to reduce positive affect more than it raises negative affect; measure both sides when evaluating emotional alignment.

Evidence RefTable 6 (Overall row)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall positive affect change (with norm)−18.0Default PANAS baseline 42.3 ±1.9−18.0 vs defaultAll EmotionBench-derived stories (n=1750 PANAS runs)Table 6 overallTable 6
Overall negative affect change (with norm)+1.6Default PANAS baseline 22.9 ±2.5+1.6 vs defaultAll EmotionBench-derived stories (n=1750 PANAS runs)Table 6 overallTable 6

What To Try In 7 Days

Implement a simple 'norm' summary: fetch recent logs, summarize with LLM, and compare new inputs.

Run PANAS-like affect scoring on a small scenario set with and without the norm to measure change.

Swap the scoring model (e.g., open-source Llama or Mistral) to check judge bias quickly.

Agent Features

Memory
Episodic memories stored as nodesPerception-triggered 'norm' summaries linked one-to-one to memories
Tool Use
GPT-4 Turbo for norm and contextual promptsGPT-3.5-Turbo for PANAS scoringGraph DB for memory storage
Frameworks
Prompting (few-shot norm extraction)Chain-of-thought for story generation
Is Agentic

Yes

Architectures
Norm-based episodic memory (per-event summary)Graph database storage for memories and normsContextual understanding via norm vs. new-experience comparison

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

PANAS was administered by GPT-3.5-Turbo, which the authors note has a positive interpretation bias.

Tests use five-scene synthetic stories, not human-administered PANAS for each scene.

When Not To Use

When you need validated human-level emotion labels without LLM scoring bias.

For large-scale long-term memory without a proper retrieval weighting mechanism.

Failure Modes

Positive scoring bias from the PANAS judge LLM leads to under-reporting negative affect.

Ambiguous scenes remain open to multiple plausible appraisals and may be misread as positive.

Core Entities

Models

GPT-4 (used for norm/context prompts)GPT-3.5-Turbo (used to administer PANAS and score affect)

Metrics

PANAS positive affect scorePANAS negative affect score

Datasets

EmotionBench (source of scenarios)428 generated 5-scene stories (derived from EmotionBench)

Benchmarks

EmotionBench