Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
1
Why It Matters For Business
Adding an explicit memory-summary step can make agent responses more context-aware and slightly more willing to register negative emotions, which matters for chatbots, virtual characters, and user-state tracking but needs human validation before deployment.
Summary TLDR
The authors add a simple episodic-memory step to generative agents: for each new text experience the agent summarizes relevant past memories into a 'norm', compares the new input to that norm, then scores affect with PANAS (via GPT-3.5-Turbo). They convert 428 EmotionBench scenarios into 5-part stories and run agents with and without the norm. Context sometimes improves emotional alignment (notably increased negative affect), but effects are small and inconsistent. Results are limited by using GPT-3.5-Turbo for scoring, five-scene stories, and no human PANAS comparison.
Problem Statement
LLMs can guess emotions but lack episode-based context that humans use to appraise events. The paper asks whether adding a memory-derived "norm" (a summary of past episodic memories) helps generative agents form emotions that match human expectations.
Main Contribution
A simple agent architecture that builds a per-event "norm" from past episodic memories, compares the new event to that norm, and uses the comparison when scoring affect.
A dataset of 428 five-scene stories derived from EmotionBench to test how emotions evolve as agents perceive sequential experiences.
An empirical comparison of agents with and without the norm using PANAS scores (administered by GPT-3.5-Turbo), showing mixed gains: context can raise negative affect but effects are modest.
Key Findings
Adding the norm increased average negative affect across EmotionBench compared to no-norm agents.
Positive affect dropped substantially when agents saw negative scenarios.
Experiment used 428 five-scene stories derived from EmotionBench as inputs.
The PANAS scoring step used GPT-3.5-Turbo, which showed a positive bias in ambiguous cases.
Results
Overall positive affect change (with norm)
Overall negative affect change (with norm)
Number of 5-scene stories tested
Who Should Care
What To Try In 7 Days
Implement a simple 'norm' summary: fetch recent logs, summarize with LLM, and compare new inputs.
Run PANAS-like affect scoring on a small scenario set with and without the norm to measure change.
Swap the scoring model (e.g., open-source Llama or Mistral) to check judge bias quickly.
Agent Features
Memory
- Episodic memories stored as nodes
- Perception-triggered 'norm' summaries linked one-to-one to memories
Tool Use
- GPT-4 Turbo for norm and contextual prompts
- GPT-3.5-Turbo for PANAS scoring
- Graph DB for memory storage
Frameworks
- Prompting (few-shot norm extraction)
- Chain-of-thought for story generation
Is Agentic
true
Architectures
- Norm-based episodic memory (per-event summary)
- Graph database storage for memories and norms
- Contextual understanding via norm vs. new-experience comparison
Reproducibility
Data Urls
- https://tsukuba-websci.github.io/GenerativeAgentsPredictEmotion/appendix
- EmotionBench (referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- PANAS was administered by GPT-3.5-Turbo, which the authors note has a positive interpretation bias.
- Tests use five-scene synthetic stories, not human-administered PANAS for each scene.
- No large-scale memory retriever was used — experiments avoid scaling by limiting stories to five parts.
- Results are mixed and effect sizes are small; context doesn't reliably disambiguate ambiguous scenes.
When Not To Use
- When you need validated human-level emotion labels without LLM scoring bias.
- For large-scale long-term memory without a proper retrieval weighting mechanism.
- When ambiguous inputs must be disambiguated without human-in-the-loop checks.
Failure Modes
- Positive scoring bias from the PANAS judge LLM leads to under-reporting negative affect.
- Ambiguous scenes remain open to multiple plausible appraisals and may be misread as positive.
- Scaling to many memories without a retriever may swamp the norm or add noise.
Core Entities
Models
- GPT-4 (used for norm/context prompts)
- GPT-3.5-Turbo (used to administer PANAS and score affect)
Metrics
- PANAS positive affect score
- PANAS negative affect score
Datasets
- EmotionBench (source of scenarios)
- 428 generated 5-scene stories (derived from EmotionBench)
Benchmarks
- EmotionBench

