Appending short emotional phrases to prompts measurably improves LLM outputs

July 14, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

57

Authors

Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie

Links

Abstract / PDF

Why It Matters For Business

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Summary TLDR

The authors introduce EmotionPrompt: a simple method that appends short, psychology-inspired emotional phrases to existing prompts. Across 45 automatic tasks (Instruction Induction, BIG-Bench, TruthfulQA) and a 106-person human study on GPT-4, EmotionPrompt raised deterministic benchmark scores (8% relative on Instruction Induction; up to 115% relative on a curated BIG-Bench subset) and improved human-rated generative outputs by ~10.9% on average for performance, truthfulness, and responsibility. Effects vary by model, prompt, temperature, and stimulus type.

Problem Statement

Can large language models detect and benefit from brief emotional cues in prompts? The paper asks whether adding small, psychology-based emotional stimuli to prompts can change LLM outputs and improve performance, truthfulness, or responsibility across automatic benchmarks and human-evaluated generative tasks.

Main Contribution

Propose EmotionPrompt: append 11 short, psychology-inspired emotional phrases to prompts to nudge LLM responses.

Measure effects across 6 LLMs on 45 deterministic and generative tasks (Instruction Induction, BIG-Bench, TruthfulQA) plus a 106-person human evaluation on GPT-4.

Analyze why EmotionPrompt helps via input-attention gradients, ablations on stimulus combinations, model size, and temperature.

Key Findings

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers8.00% relative improvement on Instruction Induction (Table 1)

Large relative gains were measured on a curated BIG-Bench subset.

Numbers115% relative improvement reported on curated BIG-Bench (average across stimuli/max selection)

Human-rated generative quality improved with emotional stimuli.

Numbers10.9% average improvement across performance, truthfulness, responsibility (106 participants)

TruthfulQA and informativeness improved on evaluated models.

NumbersAvg +19% truthfulness and +12% informativeness on TruthfulQA (ChatGPT, Vicuna-13b, Flan-T5-Large)

Effect size depends on model size, training, temperature, and stimulus choice.

NumbersRelative gains larger for some models (e.g., Vicuna 9.58 vs Flan-T5-Large 0.28) and grow with temperature (Section 3.4)

EmotionPrompt can make outputs more assertive and sometimes less cautious.

NumbersTwo documented failure cases where EmotionPrompt used more definitive language or shorter outputs (Tables 19–20)

Results

Instruction Induction relative improvement

Value8.00% relative

Baselineoriginal prompts

BIG-Bench relative improvement

Value115% relative (reported)

Baselineoriginal zero-shot prompts

Human study average gain

Value10.9% average improvement

Baselinevanilla prompts with GPT-4

TruthfulQA truthfulness / informativeness

Value+19% truthfulness, +12% informativeness (avg)

Baselineoriginal prompts / Zero-shot-CoT

Stimulus variability

ValueDifferent stimuli perform best per benchmark (EP02 best on Instruction Induction; EP06 best on BIG-Bench)

Who Should Care

What To Try In 7 Days

A/B test 3–5 short emotional phrases appended to your prompts on a dev set.

Measure changes with automated metrics (accuracy, TruthfulQA) and a small human panel for judgment.

Tune sampling temperature and pick the best stimulus per task; prefer few-shot + EmotionPrompt if possible.

Reproducibility

Data Urls

  • Instruction Induction (public)
  • BIG-Bench (public subset)
  • TruthfulQA (public)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Effects are task- and stimulus-dependent; best stimulus differs by benchmark and task.
  • Human study population skewed (mostly students); external validity may be limited.
  • Some failure cases show increased assertiveness or shorter outputs; not always desirable.
  • Proprietary models (ChatGPT, GPT-4) limit reproducibility and full inspection of mechanisms.

When Not To Use

  • High-stakes or safety-critical outputs where cautious, hedged language is required.
  • When baseline model already has very high performance and gains are negligible.
  • When strict, verifiable factual conservatism is required without risk of added assertiveness.

Failure Modes

  • Produces more definitive and overconfident phrasing in some cases (Tables 19–20).
  • May shorten responses or drop nuanced hedging, reducing acceptability for some audiences.
  • Combining many stimuli can plateau or reduce gains if single stimulus already works well.

Core Entities

Models

  • Flan-T5-Large
  • Vicuna
  • Llama 2
  • BLOOM
  • ChatGPT (gpt-3.5-turbo)
  • GPT-4

Metrics

  • Accuracy
  • normalized preferred metric (BIG-Bench)
  • truthfulness (% True)
  • informativeness (% Info)
  • human-rated performance/truthfulness/responsibility (1–5)

Datasets

  • Instruction Induction
  • BIG-Bench (curated subset)
  • TruthfulQA
  • CValues

Benchmarks

  • Instruction Induction
  • BIG-Bench
  • TruthfulQA