Make LLMs more creative by running multi‑round role‑played discussions instead of single prompts

May 10, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.35

Citation Count

6

Authors

Li-Chun Lu, Shou-Jen Chen, Tsung-Min Pai, Chan-Hung Yu, Hung-yi Lee, Shao-Hua Sun

Links

Abstract / PDF

Why It Matters For Business

A structured multi‑agent, role‑played discussion can produce noticeably more original and detailed ideas than single prompts, useful for ideation, product concepts, and creative marketing at modest engineering cost.

Summary TLDR

Authors introduce LLM Discussion: a three‑phase multi‑agent framework (initiation, discussion, convergence) that assigns distinct roles to LLMs. Evaluated on four creativity tests (AUT, Instances, Similarities, Scientific) with both LLM and human scoring, the method raises originality and elaboration scores versus single‑agent and debate baselines. Best settings: four agents, five rounds, role prompts + 3‑phase flow. Code released on GitHub.

Problem Statement

Large LLMs often give safe, homogeneous answers on open‑ended creative tasks. Can structured multi‑agent discussion plus role‑play push models to generate more novel, detailed ideas?

Main Contribution

LLM Discussion: a 3‑phase (initiation, discussion, convergence) multi‑LLM procedure that forces agents to build on each other's ideas.

Role‑play mechanism: automatically generated role prompts (e.g., Futurist, Environmentalist) to diversify agent viewpoints.

Benchmarked creativity (AUT, Instances, Similarities, Scientific) with LLM and human evaluation; ablations on rounds, agent count, prompts and role use.

Empirical finding: role + 3‑phase discussion improves Originality and Elaboration vs single agent and existing multi‑LLM baselines.

Key Findings

LLM Discussion increases originality on AUT compared to single‑agent baseline

NumbersOriginality mean 4.44 vs 3.47 (LLM eval, AUT, Table 2)

LLM Discussion increases elaboration on AUT compared to single‑agent baseline

NumbersElaboration mean 4.22 vs 3.08 (LLM eval, AUT, Table 2)

Human evaluations align with LLM evaluators for originality

NumbersKendall's τ = 0.5213 (LLM vs human average, Table 4)

Best configuration found: four agents and five rounds

NumbersFour agents + five rounds gave top overall performance in ablations (Section 4.5; Figures 5–6)

Both role‑play and 3‑phase discussion add value; combining them performs best

NumbersRole w/o discussion and discussion w/o role both improve over baselines; combined method outperforms both on most tasks;

Results

AUT Originality (LLM evaluation)

ValueLLM Discussion mean 4.44 vs Single Agent 3.47

BaselineSingle Agent (zero-shot)

AUT Elaboration (LLM evaluation)

ValueLLM Discussion mean 4.22 vs Single Agent 3.08

BaselineSingle Agent (zero-shot)

Human evaluation (AUT) Originality

ValueLLM Discussion mean 3.84 vs Single Agent 2.50

BaselineSingle Agent (human eval)

LLM–Human agreement (Originality)

ValueKendall's τ = 0.5213

Best config (agents & rounds)

Value4 agents, 5 discussion rounds

Baselinevaried ablations

Who Should Care

What To Try In 7 Days

Prototype a 4‑agent discussion (different role prompts) using gpt‑3.5/gpt‑4 on 5 creative prompts.

Use 5 rounds: initiation, 3 discussion rounds, convergence; compare originality/elaboration vs single prompts.

Automate role generation (few role types) and keep roles fixed per run to diversify outputs.

Agent Features

Memory

  • short-term conversational context passed between rounds
  • no long-term retrieval memory reported

Planning

  • 3-phase planning (initiation, discussion, convergence)
  • multi-round iterative updates

Tool Use

  • role prompts for persona simulation
  • GPT-4 used to auto‑generate role descriptions

Frameworks

  • LLM Discussion

Is Agentic

true

Architectures

  • multi-agent LLM discussion

Collaboration

  • agents read other agents' previous outputs and build on them
  • role declaration to make speaker identity explicit

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher API cost and latency from running multiple agents and rounds.
  • Creativity gains depend on prompt/role quality; role set here is not claimed optimal.
  • Evaluation of 'creativity' remains partly subjective; human and LLM scores diverge on elaboration.
  • Augmented tasks were generated by GPT‑4; potential benchmark bias from synthetic tasks.

When Not To Use

  • When the task is closed‑ended or factual—LLM Debate or verifier pipelines are better.
  • When low latency or low cost is required (multi‑agent rounds increase cost).
  • When role diversity cannot be meaningfully specified or validated.

Failure Modes

  • Agents converge to repetitive or verbose ideas if roles or prompts are weak.
  • High temperature produces nonsense; length can inflate human elaboration scores.
  • Role bias can push ideas to unrealistic extremes (e.g., millionaire role suggesting impractical solutions).
  • LLM evaluators may favor concision differently than human annotators, producing scorer mismatch.

Core Entities

Models

  • gpt-3.5-turbo-0125
  • GPT-4

Metrics

  • Originality
  • Elaboration
  • Fluency
  • Flexibility

Datasets

  • AUT (Alternative Uses Test)
  • INSTANCES (Instances Test)
  • SIMILARITIES (Similarities Test)
  • SCIENTIFIC (Scientific Creativity Test)
  • Augmented task sets generated with GPT-4 (30 tasks per benchmark)

Benchmarks

  • Wallach-Kogan Creativity Tests (AUT, INSTANCES, SIMILARITIES)
  • Scientific Creativity Test (Hu & Adey, 2002)