Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
57
Why It Matters For Business
A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.
Summary TLDR
The authors introduce EmotionPrompt: a simple method that appends short, psychology-inspired emotional phrases to existing prompts. Across 45 automatic tasks (Instruction Induction, BIG-Bench, TruthfulQA) and a 106-person human study on GPT-4, EmotionPrompt raised deterministic benchmark scores (8% relative on Instruction Induction; up to 115% relative on a curated BIG-Bench subset) and improved human-rated generative outputs by ~10.9% on average for performance, truthfulness, and responsibility. Effects vary by model, prompt, temperature, and stimulus type.
Problem Statement
Can large language models detect and benefit from brief emotional cues in prompts? The paper asks whether adding small, psychology-based emotional stimuli to prompts can change LLM outputs and improve performance, truthfulness, or responsibility across automatic benchmarks and human-evaluated generative tasks.
Main Contribution
Propose EmotionPrompt: append 11 short, psychology-inspired emotional phrases to prompts to nudge LLM responses.
Measure effects across 6 LLMs on 45 deterministic and generative tasks (Instruction Induction, BIG-Bench, TruthfulQA) plus a 106-person human evaluation on GPT-4.
Analyze why EmotionPrompt helps via input-attention gradients, ablations on stimulus combinations, model size, and temperature.
Key Findings
EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.
Large relative gains were measured on a curated BIG-Bench subset.
Human-rated generative quality improved with emotional stimuli.
TruthfulQA and informativeness improved on evaluated models.
Effect size depends on model size, training, temperature, and stimulus choice.
EmotionPrompt can make outputs more assertive and sometimes less cautious.
Results
Instruction Induction relative improvement
BIG-Bench relative improvement
Human study average gain
TruthfulQA truthfulness / informativeness
Stimulus variability
Who Should Care
What To Try In 7 Days
A/B test 3–5 short emotional phrases appended to your prompts on a dev set.
Measure changes with automated metrics (accuracy, TruthfulQA) and a small human panel for judgment.
Tune sampling temperature and pick the best stimulus per task; prefer few-shot + EmotionPrompt if possible.
Reproducibility
Data Urls
- Instruction Induction (public)
- BIG-Bench (public subset)
- TruthfulQA (public)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Effects are task- and stimulus-dependent; best stimulus differs by benchmark and task.
- Human study population skewed (mostly students); external validity may be limited.
- Some failure cases show increased assertiveness or shorter outputs; not always desirable.
- Proprietary models (ChatGPT, GPT-4) limit reproducibility and full inspection of mechanisms.
When Not To Use
- High-stakes or safety-critical outputs where cautious, hedged language is required.
- When baseline model already has very high performance and gains are negligible.
- When strict, verifiable factual conservatism is required without risk of added assertiveness.
Failure Modes
- Produces more definitive and overconfident phrasing in some cases (Tables 19–20).
- May shorten responses or drop nuanced hedging, reducing acceptability for some audiences.
- Combining many stimuli can plateau or reduce gains if single stimulus already works well.
Core Entities
Models
- Flan-T5-Large
- Vicuna
- Llama 2
- BLOOM
- ChatGPT (gpt-3.5-turbo)
- GPT-4
Metrics
- Accuracy
- normalized preferred metric (BIG-Bench)
- truthfulness (% True)
- informativeness (% Info)
- human-rated performance/truthfulness/responsibility (1–5)
Datasets
- Instruction Induction
- BIG-Bench (curated subset)
- TruthfulQA
- CValues
Benchmarks
- Instruction Induction
- BIG-Bench
- TruthfulQA

