Make two LLMs argue, judge their claims, and tune debate tone to reduce bias and hallucination

January 19, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

6

Authors

Edward Y. Chang

Links

Abstract / PDF

Why It Matters For Business

SocraSynth turns LLM outputs from single-shot answers into cross-checked, debate-driven recommendations, which reduces obvious bias and yields richer, testable proposals—useful for policy, diagnostics, and decision support.

Summary TLDR

SocraSynth is a multi-agent platform that runs controlled debates between two LLMs, uses other LLMs as judges (CRIT) to score argument quality, and adjusts a 'contentiousness' knob to move from adversarial to collaborative outputs. The paper reports three case studies (policy debate, medical symptom checking on a 4,921-record Kaggle dataset, and an analysis of contentiousness), shows judges prefer debate outputs to Q&A, and presents practical prompts and pseudocode. The system aims to reduce bias and hallucination by forcing cross-checks and iterative refinement rather than adding external logic components.

Problem Statement

LLMs are powerful but prone to bias, hallucination, and weak reasoning. The paper asks: can structured multi-LLM debates plus an LLM-based evaluation score (CRIT) and a tunable 'contentiousness' parameter produce more reliable, richer outputs than single-model Q&A?

Main Contribution

SocraSynth: a practical multi-LLM debate platform combining a human moderator, two opposing LLM agents, and multiple LLM judges.

Conditional statistics mechanism: each debating LLM conditions responses on the opponent's stance to surface diverse viewpoints.

Contentiousness modulation: a numeric knob (0–1) to steer debate tone from adversarial to conciliatory, with a scheduling rule (divide by 1.2 each round).

CRIT evaluation: an LLM-driven scoring template that rates argument reasonableness (1–10) and can be applied recursively to supporting sources.

Key Findings

Multi-agent debates scored higher than single-model Q&A on judged information quality.

NumbersTable 5 & 6: GPT-4 judge totals A=39 vs B=32 (Table 5); role-swapped totals remain comparable

Tuning contentiousness moves outputs from adversarial to cooperative and changes content emphasis.

NumbersContentiousness tested at {0.9,0.7,0.5,0.3,0.0}; scheduling: divide by 1.2 per round

Debate can improve medical triage: LLMs converged from different diagnoses to a specific recommendation and suggested confirmatory tests.

NumbersCase: Kaggle dataset (4,921 records); example where Bard conceded to GPT-4 after two rounds

Results

Judge total scores (example, Table 5, GPT-4)

ValueAgent A total 39 vs Agent B total 32

Judge total scores (role-swapped, Table 6, combined judges)

ValueTotals roughly tie or favor Agent A across configurations (examples: 38 vs 38, 36 vs 39, see table)

Medical-case convergence

ValueBard concedes to GPT-4 after two rounds; agents jointly recommend specific tests

Who Should Care

What To Try In 7 Days

Run a two-LLM debate (opposing prompts) on a product policy question and compare outputs to single-shot answers.

Add a small panel of LLM judges (e.g., GPT-4, GPT-3.5, text-davinci-003) to score reasonableness using a CRIT-like checklist.

Experiment with contentiousness: start at 0.9 for adversarial exploration, lower to ~0.3 to produce a joint proposal.

Agent Features

Memory

  • Context refinement across debate rounds (iterative exchange)

Planning

  • Debate rounds with iterative refutation
  • Moderator-driven topic decomposition and agenda

Tool Use

  • LLMs as proponent/opponent agents
  • LLMs as judges (CRIT)
  • Contentiousness parameter for tone control

Frameworks

  • SocraSynth
  • CRIT

Is Agentic

true

Architectures

  • Multi-LLM agent ensemble (no new model weights)

Collaboration

  • Human moderator + two LLM agents + LLM judges
  • Joint proposal drafting after contention reduced

Optimization Features

Token Efficiency

  • Implicit: iterative rounds increase context but no explicit compression

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No public code or reproducible pipeline provided in paper.
  • Evidence comes from three case studies rather than large controlled benchmarks.
  • CRIT judges are LLMs and may carry correlated biases with debating agents.
  • No quantitative ablation isolating contentiousness from other variables.

When Not To Use

  • High-stakes settings requiring formal verification without human oversight.
  • Tasks demanding provable factual accuracy rather than reasonableness.
  • Environments where access to multiple LLMs is cost-prohibitive.

Failure Modes

  • Both agents converge on the same false premise, making mutual checking ineffective.
  • Judge panel shares biases with agents, causing blind spots.
  • Overfitting to debate style instead of factual grounding.
  • Contentiousness tuning produces emotionally charged or misleading language.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • text-davinci-003
  • Bard

Metrics

  • CRIT score (1-10)
  • Contentiousness level (0.0-1.0)

Datasets

  • Kaggle disease-symptom dataset (4,921 records)

Context Entities

Models

  • GPT-4
  • GPT-3.5
  • text-davinci-003
  • Bard

Metrics

  • Judge totals (per-topic scores, Tables 5–6)

Datasets

  • Kaggle disease-symptom dataset