Giving LLM agents a memory 'norm' can sometimes make their emotions more human-like — but results are mixed

February 6, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

1

Authors

Ciaran Regan, Nanami Iwahashi, Shogo Tanaka, Mizuki Oka

Links

Abstract / PDF

Why It Matters For Business

Adding an explicit memory-summary step can make agent responses more context-aware and slightly more willing to register negative emotions, which matters for chatbots, virtual characters, and user-state tracking but needs human validation before deployment.

Summary TLDR

The authors add a simple episodic-memory step to generative agents: for each new text experience the agent summarizes relevant past memories into a 'norm', compares the new input to that norm, then scores affect with PANAS (via GPT-3.5-Turbo). They convert 428 EmotionBench scenarios into 5-part stories and run agents with and without the norm. Context sometimes improves emotional alignment (notably increased negative affect), but effects are small and inconsistent. Results are limited by using GPT-3.5-Turbo for scoring, five-scene stories, and no human PANAS comparison.

Problem Statement

LLMs can guess emotions but lack episode-based context that humans use to appraise events. The paper asks whether adding a memory-derived "norm" (a summary of past episodic memories) helps generative agents form emotions that match human expectations.

Main Contribution

A simple agent architecture that builds a per-event "norm" from past episodic memories, compares the new event to that norm, and uses the comparison when scoring affect.

A dataset of 428 five-scene stories derived from EmotionBench to test how emotions evolve as agents perceive sequential experiences.

An empirical comparison of agents with and without the norm using PANAS scores (administered by GPT-3.5-Turbo), showing mixed gains: context can raise negative affect but effects are modest.

Key Findings

Adding the norm increased average negative affect across EmotionBench compared to no-norm agents.

Numbers+1.6 (overall negative affect increase with norm)

Positive affect dropped substantially when agents saw negative scenarios.

Numbers−18.0 (overall positive affect decrease with norm)

Experiment used 428 five-scene stories derived from EmotionBench as inputs.

Numbers428 stories generated

The PANAS scoring step used GPT-3.5-Turbo, which showed a positive bias in ambiguous cases.

Results

Overall positive affect change (with norm)

Value−18.0

BaselineDefault PANAS baseline 42.3 ±1.9

Overall negative affect change (with norm)

Value+1.6

BaselineDefault PANAS baseline 22.9 ±2.5

Number of 5-scene stories tested

Value428

Who Should Care

What To Try In 7 Days

Implement a simple 'norm' summary: fetch recent logs, summarize with LLM, and compare new inputs.

Run PANAS-like affect scoring on a small scenario set with and without the norm to measure change.

Swap the scoring model (e.g., open-source Llama or Mistral) to check judge bias quickly.

Agent Features

Memory

  • Episodic memories stored as nodes
  • Perception-triggered 'norm' summaries linked one-to-one to memories

Tool Use

  • GPT-4 Turbo for norm and contextual prompts
  • GPT-3.5-Turbo for PANAS scoring
  • Graph DB for memory storage

Frameworks

  • Prompting (few-shot norm extraction)
  • Chain-of-thought for story generation

Is Agentic

true

Architectures

  • Norm-based episodic memory (per-event summary)
  • Graph database storage for memories and norms
  • Contextual understanding via norm vs. new-experience comparison

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • PANAS was administered by GPT-3.5-Turbo, which the authors note has a positive interpretation bias.
  • Tests use five-scene synthetic stories, not human-administered PANAS for each scene.
  • No large-scale memory retriever was used — experiments avoid scaling by limiting stories to five parts.
  • Results are mixed and effect sizes are small; context doesn't reliably disambiguate ambiguous scenes.

When Not To Use

  • When you need validated human-level emotion labels without LLM scoring bias.
  • For large-scale long-term memory without a proper retrieval weighting mechanism.
  • When ambiguous inputs must be disambiguated without human-in-the-loop checks.

Failure Modes

  • Positive scoring bias from the PANAS judge LLM leads to under-reporting negative affect.
  • Ambiguous scenes remain open to multiple plausible appraisals and may be misread as positive.
  • Scaling to many memories without a retriever may swamp the norm or add noise.

Core Entities

Models

  • GPT-4 (used for norm/context prompts)
  • GPT-3.5-Turbo (used to administer PANAS and score affect)

Metrics

  • PANAS positive affect score
  • PANAS negative affect score

Datasets

  • EmotionBench (source of scenarios)
  • 428 generated 5-scene stories (derived from EmotionBench)

Benchmarks

  • EmotionBench