Giving LLM agents a memory 'norm' can sometimes make their emotions more human-like — but results are mixed

Overview

Decision SnapshotNeeds Validation

The architecture is simple and implementable, but evidence is preliminary (428 short stories, single scoring LLM) and lacks human PANAS validation, so treat results as exploratory.

Citations1

Evidence Strength0.50

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Ciaran Regan, Nanami Iwahashi, Shogo Tanaka, Mizuki Oka

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Adding an explicit memory-summary step can make agent responses more context-aware and slightly more willing to register negative emotions, which matters for chatbots, virtual characters, and user-state tracking but needs human validation before deployment.

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

The authors add a simple episodic-memory step to generative agents: for each new text experience the agent summarizes relevant past memories into a 'norm', compares the new input to that norm, then scores affect with PANAS (via GPT-3.5-Turbo). They convert 428 EmotionBench scenarios into 5-part stories and run agents with and without the norm. Context sometimes improves emotional alignment (notably increased negative affect), but effects are small and inconsistent. Results are limited by using GPT-3.5-Turbo for scoring, five-scene stories, and no human PANAS comparison.

Problem Statement

LLMs can guess emotions but lack episode-based context that humans use to appraise events. The paper asks whether adding a memory-derived "norm" (a summary of past episodic memories) helps generative agents form emotions that match human expectations.

Main Contribution

A simple agent architecture that builds a per-event "norm" from past episodic memories, compares the new event to that norm, and uses the comparison when scoring affect.

A dataset of 428 five-scene stories derived from EmotionBench to test how emotions evolve as agents perceive sequential experiences.

Key Findings

Adding the norm increased average negative affect across EmotionBench compared to no-norm agents.

Numbers+1.6 (overall negative affect increase with norm)

Practical UseIf you want agents to register slightly more negative emotion in negative scenarios, provide a memory-based context; expect a small effect size.

Evidence RefTable 6 (Overall row)

Positive affect dropped substantially when agents saw negative scenarios.

Numbers−18.0 (overall positive affect decrease with norm)

Practical UseContext tends to reduce positive affect more than it raises negative affect; measure both sides when evaluating emotional alignment.

Evidence RefTable 6 (Overall row)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall positive affect change (with norm)	−18.0	Default PANAS baseline 42.3 ±1.9	−18.0 vs default	All EmotionBench-derived stories (n=1750 PANAS runs)	Table 6 overall	Table 6
Overall negative affect change (with norm)	+1.6	Default PANAS baseline 22.9 ±2.5	+1.6 vs default	All EmotionBench-derived stories (n=1750 PANAS runs)	Table 6 overall	Table 6

What To Try In 7 Days

Implement a simple 'norm' summary: fetch recent logs, summarize with LLM, and compare new inputs.

Run PANAS-like affect scoring on a small scenario set with and without the norm to measure change.

Swap the scoring model (e.g., open-source Llama or Mistral) to check judge bias quickly.

Agent Features

Memory

Episodic memories stored as nodesPerception-triggered 'norm' summaries linked one-to-one to memories

Tool Use

GPT-4 Turbo for norm and contextual promptsGPT-3.5-Turbo for PANAS scoringGraph DB for memory storage

Frameworks

Prompting (few-shot norm extraction)Chain-of-thought for story generation

Is Agentic

Yes

Architectures

Norm-based episodic memory (per-event summary)Graph database storage for memories and normsContextual understanding via norm vs. new-experience comparison

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/tsukuba-websci/GenerativeAgentsPredictEmotion

Data URLs

https://tsukuba-websci.github.io/GenerativeAgentsPredictEmotion/appendixEmotionBench (referenced in paper)

Risks & Boundaries

Limitations

PANAS was administered by GPT-3.5-Turbo, which the authors note has a positive interpretation bias.

Tests use five-scene synthetic stories, not human-administered PANAS for each scene.

When Not To Use

When you need validated human-level emotion labels without LLM scoring bias.

For large-scale long-term memory without a proper retrieval weighting mechanism.

Failure Modes

Positive scoring bias from the PANAS judge LLM leads to under-reporting negative affect.

Ambiguous scenes remain open to multiple plausible appraisals and may be misread as positive.

Core Entities

Models

GPT-4 (used for norm/context prompts)GPT-3.5-Turbo (used to administer PANAS and score affect)

Metrics

PANAS positive affect scorePANAS negative affect score

Datasets

EmotionBench (source of scenarios)428 generated 5-scene stories (derived from EmotionBench)

Benchmarks

EmotionBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding the norm increased average negative affect across EmotionBench compared to no-norm agents.

Positive affect dropped substantially when agents saw negative scenarios.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding

Replace flat context with a graph memory (TME) to cut hallucinations and save tokens in multi-step LLM agents

Key finding

Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

Key finding

AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

Key finding

A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

Key finding