Appending short emotional phrases to prompts measurably improves LLM outputs

Overview

Decision SnapshotNeeds Validation

The method is simple to try and low cost; evidence includes automated benchmarks and a sizable human study, but results vary by model, task, and stimulus so validate per deployment.

Citations57

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie

Links

Abstract / PDF / Data

Why It Matters For Business

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors introduce EmotionPrompt: a simple method that appends short, psychology-inspired emotional phrases to existing prompts. Across 45 automatic tasks (Instruction Induction, BIG-Bench, TruthfulQA) and a 106-person human study on GPT-4, EmotionPrompt raised deterministic benchmark scores (8% relative on Instruction Induction; up to 115% relative on a curated BIG-Bench subset) and improved human-rated generative outputs by ~10.9% on average for performance, truthfulness, and responsibility. Effects vary by model, prompt, temperature, and stimulus type.

Problem Statement

Can large language models detect and benefit from brief emotional cues in prompts? The paper asks whether adding small, psychology-based emotional stimuli to prompts can change LLM outputs and improve performance, truthfulness, or responsibility across automatic benchmarks and human-evaluated generative tasks.

Main Contribution

Propose EmotionPrompt: append 11 short, psychology-inspired emotional phrases to prompts to nudge LLM responses.

Measure effects across 6 LLMs on 45 deterministic and generative tasks (Instruction Induction, BIG-Bench, TruthfulQA) plus a 106-person human evaluation on GPT-4.

Key Findings

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers8.00% relative improvement on Instruction Induction (Table 1)

Practical UseAdd short emotional phrases to prompts to get small but consistent accuracy gains on instruction-following tasks without changing model or data.

Evidence RefAbstract; Table 1

Large relative gains were measured on a curated BIG-Bench subset.

Numbers115% relative improvement reported on curated BIG-Bench (average across stimuli/max selection)

Practical UseFor challenging or low-baseline multiple-choice tasks, testing EmotionPrompt (and picking the best stimulus) can yield large proportional gains; expect variance across tasks.

Evidence RefAbstract; Table 1 (Big-Bench rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Instruction Induction relative improvement	8.00% relative	original prompts	8.00%	Instruction Induction (24 tasks)	Mean across models, Table 1	Table 1; Section 2.2
BIG-Bench relative improvement	115% relative (reported)	original zero-shot prompts	115%	BIG-Bench (21 tasks, zero-shot)	Reported relative improvement in Abstract and Table 1; +Ours (max) rows	Abstract; Table 1

What To Try In 7 Days

A/B test 3–5 short emotional phrases appended to your prompts on a dev set.

Measure changes with automated metrics (accuracy, TruthfulQA) and a small human panel for judgment.

Tune sampling temperature and pick the best stimulus per task; prefer few-shot + EmotionPrompt if possible.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Instruction Induction (public)BIG-Bench (public subset)TruthfulQA (public)

Risks & Boundaries

Limitations

Effects are task- and stimulus-dependent; best stimulus differs by benchmark and task.

Human study population skewed (mostly students); external validity may be limited.

When Not To Use

High-stakes or safety-critical outputs where cautious, hedged language is required.

When baseline model already has very high performance and gains are negligible.

Failure Modes

Produces more definitive and overconfident phrasing in some cases (Tables 19–20).

May shorten responses or drop nuanced hedging, reducing acceptability for some audiences.

Core Entities

Models

Flan-T5-LargeVicunaLlama 2BLOOMChatGPT (gpt-3.5-turbo)GPT-4

Metrics

Accuracynormalized preferred metric (BIG-Bench)truthfulness (% True)informativeness (% Info)human-rated performance/truthfulness/responsibility (1–5)

Datasets

Instruction InductionBIG-Bench (curated subset)TruthfulQACValues

Benchmarks

Instruction InductionBIG-BenchTruthfulQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Large relative gains were measured on a curated BIG-Bench subset.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding