Overview
The method is simple to try and low cost; evidence includes automated benchmarks and a sizable human study, but results vary by model, task, and stimulus so validate per deployment.
Citations57
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.
Who Should Care
Summary TLDR
The authors introduce EmotionPrompt: a simple method that appends short, psychology-inspired emotional phrases to existing prompts. Across 45 automatic tasks (Instruction Induction, BIG-Bench, TruthfulQA) and a 106-person human study on GPT-4, EmotionPrompt raised deterministic benchmark scores (8% relative on Instruction Induction; up to 115% relative on a curated BIG-Bench subset) and improved human-rated generative outputs by ~10.9% on average for performance, truthfulness, and responsibility. Effects vary by model, prompt, temperature, and stimulus type.
Problem Statement
Can large language models detect and benefit from brief emotional cues in prompts? The paper asks whether adding small, psychology-based emotional stimuli to prompts can change LLM outputs and improve performance, truthfulness, or responsibility across automatic benchmarks and human-evaluated generative tasks.
Main Contribution
Propose EmotionPrompt: append 11 short, psychology-inspired emotional phrases to prompts to nudge LLM responses.
Measure effects across 6 LLMs on 45 deterministic and generative tasks (Instruction Induction, BIG-Bench, TruthfulQA) plus a 106-person human evaluation on GPT-4.
Key Findings
EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.
Large relative gains were measured on a curated BIG-Bench subset.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Instruction Induction relative improvement | 8.00% relative | original prompts | 8.00% | Instruction Induction (24 tasks) | Mean across models, Table 1 | Table 1; Section 2.2 |
| BIG-Bench relative improvement | 115% relative (reported) | original zero-shot prompts | 115% | BIG-Bench (21 tasks, zero-shot) | Reported relative improvement in Abstract and Table 1; +Ours (max) rows | Abstract; Table 1 |
What To Try In 7 Days
A/B test 3–5 short emotional phrases appended to your prompts on a dev set.
Measure changes with automated metrics (accuracy, TruthfulQA) and a small human panel for judgment.
Tune sampling temperature and pick the best stimulus per task; prefer few-shot + EmotionPrompt if possible.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Effects are task- and stimulus-dependent; best stimulus differs by benchmark and task.
Human study population skewed (mostly students); external validity may be limited.
When Not To Use
High-stakes or safety-critical outputs where cautious, hedged language is required.
When baseline model already has very high performance and gains are negligible.
Failure Modes
Produces more definitive and overconfident phrasing in some cases (Tables 19–20).
May shorten responses or drop nuanced hedging, reducing acceptability for some audiences.

