Appending short emotional phrases to prompts measurably improves LLM outputs

July 14, 20238 min

Overview

Decision SnapshotNeeds Validation

The method is simple to try and low cost; evidence includes automated benchmarks and a sizable human study, but results vary by model, task, and stimulus so validate per deployment.

Citations57

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie

Links

Abstract / PDF / Data

Why It Matters For Business

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Who Should Care

Summary TLDR

The authors introduce EmotionPrompt: a simple method that appends short, psychology-inspired emotional phrases to existing prompts. Across 45 automatic tasks (Instruction Induction, BIG-Bench, TruthfulQA) and a 106-person human study on GPT-4, EmotionPrompt raised deterministic benchmark scores (8% relative on Instruction Induction; up to 115% relative on a curated BIG-Bench subset) and improved human-rated generative outputs by ~10.9% on average for performance, truthfulness, and responsibility. Effects vary by model, prompt, temperature, and stimulus type.

Problem Statement

Can large language models detect and benefit from brief emotional cues in prompts? The paper asks whether adding small, psychology-based emotional stimuli to prompts can change LLM outputs and improve performance, truthfulness, or responsibility across automatic benchmarks and human-evaluated generative tasks.

Main Contribution

Propose EmotionPrompt: append 11 short, psychology-inspired emotional phrases to prompts to nudge LLM responses.

Measure effects across 6 LLMs on 45 deterministic and generative tasks (Instruction Induction, BIG-Bench, TruthfulQA) plus a 106-person human evaluation on GPT-4.

Key Findings

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers8.00% relative improvement on Instruction Induction (Table 1)

Practical UseAdd short emotional phrases to prompts to get small but consistent accuracy gains on instruction-following tasks without changing model or data.

Evidence RefAbstract; Table 1

Large relative gains were measured on a curated BIG-Bench subset.

Numbers115% relative improvement reported on curated BIG-Bench (average across stimuli/max selection)

Practical UseFor challenging or low-baseline multiple-choice tasks, testing EmotionPrompt (and picking the best stimulus) can yield large proportional gains; expect variance across tasks.

Evidence RefAbstract; Table 1 (Big-Bench rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Instruction Induction relative improvement8.00% relativeoriginal prompts8.00%Instruction Induction (24 tasks)Mean across models, Table 1Table 1; Section 2.2
BIG-Bench relative improvement115% relative (reported)original zero-shot prompts115%BIG-Bench (21 tasks, zero-shot)Reported relative improvement in Abstract and Table 1; +Ours (max) rowsAbstract; Table 1

What To Try In 7 Days

A/B test 3–5 short emotional phrases appended to your prompts on a dev set.

Measure changes with automated metrics (accuracy, TruthfulQA) and a small human panel for judgment.

Tune sampling temperature and pick the best stimulus per task; prefer few-shot + EmotionPrompt if possible.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Instruction Induction (public)BIG-Bench (public subset)TruthfulQA (public)

Risks & Boundaries

Limitations

Effects are task- and stimulus-dependent; best stimulus differs by benchmark and task.

Human study population skewed (mostly students); external validity may be limited.

When Not To Use

High-stakes or safety-critical outputs where cautious, hedged language is required.

When baseline model already has very high performance and gains are negligible.

Failure Modes

Produces more definitive and overconfident phrasing in some cases (Tables 19–20).

May shorten responses or drop nuanced hedging, reducing acceptability for some audiences.

Core Entities

Models

Flan-T5-LargeVicunaLlama 2BLOOMChatGPT (gpt-3.5-turbo)GPT-4

Metrics

Accuracynormalized preferred metric (BIG-Bench)truthfulness (% True)informativeness (% Info)human-rated performance/truthfulness/responsibility (1–5)

Datasets

Instruction InductionBIG-Bench (curated subset)TruthfulQACValues

Benchmarks

Instruction InductionBIG-BenchTruthfulQA