Make two LLMs argue, judge their claims, and tune debate tone to reduce bias and hallucination

January 19, 20247 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and moderately validated by case studies, but lacks large-scale benchmarks, public code, and formal ablation studies.

Citations6

Evidence Strength0.60

Confidence0.60

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Edward Y. Chang

Links

Abstract / PDF / Data

Why It Matters For Business

SocraSynth turns LLM outputs from single-shot answers into cross-checked, debate-driven recommendations, which reduces obvious bias and yields richer, testable proposals—useful for policy, diagnostics, and decision support.

Who Should Care

Summary TLDR

SocraSynth is a multi-agent platform that runs controlled debates between two LLMs, uses other LLMs as judges (CRIT) to score argument quality, and adjusts a 'contentiousness' knob to move from adversarial to collaborative outputs. The paper reports three case studies (policy debate, medical symptom checking on a 4,921-record Kaggle dataset, and an analysis of contentiousness), shows judges prefer debate outputs to Q&A, and presents practical prompts and pseudocode. The system aims to reduce bias and hallucination by forcing cross-checks and iterative refinement rather than adding external logic components.

Problem Statement

LLMs are powerful but prone to bias, hallucination, and weak reasoning. The paper asks: can structured multi-LLM debates plus an LLM-based evaluation score (CRIT) and a tunable 'contentiousness' parameter produce more reliable, richer outputs than single-model Q&A?

Main Contribution

SocraSynth: a practical multi-LLM debate platform combining a human moderator, two opposing LLM agents, and multiple LLM judges.

Conditional statistics mechanism: each debating LLM conditions responses on the opponent's stance to surface diverse viewpoints.

Key Findings

Multi-agent debates scored higher than single-model Q&A on judged information quality.

NumbersTable 5 & 6: GPT-4 judge totals A=39 vs B=32 (Table 5); role-swapped totals remain comparable

Practical UseUse a two-agent debate plus LLM judges to get deeper, more balanced answers than a single LLM Q&A when you need richer justification.

Evidence RefTables 5 and 6; Section 3.1.4

Tuning contentiousness moves outputs from adversarial to cooperative and changes content emphasis.

NumbersContentiousness tested at {0.9,0.7,0.5,0.3,0.0}; scheduling: divide by 1.2 per round

Practical UseAdjust contentiousness (start high to surface trade-offs, lower to get consensus) to control tone and breadth of perspectives.

Evidence RefTable 1; SocraSynth algorithm in Section 2.1.3 and 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Judge total scores (example, Table 5, GPT-4)Agent A total 39 vs Agent B total 32A +7Policy debate (Section 3.1)Table 5: GPT-4 columns; Agent A wins in that configurationTable 5
Judge total scores (role-swapped, Table 6, combined judges)Totals roughly tie or favor Agent A across configurations (examples: 38 vs 38, 36 vs 39, see table)Policy debate role-reversal (Section 3.1.3)Table 6 shows mixed totals but judges still find debate informativeTable 6

What To Try In 7 Days

Run a two-LLM debate (opposing prompts) on a product policy question and compare outputs to single-shot answers.

Add a small panel of LLM judges (e.g., GPT-4, GPT-3.5, text-davinci-003) to score reasonableness using a CRIT-like checklist.

Experiment with contentiousness: start at 0.9 for adversarial exploration, lower to ~0.3 to produce a joint proposal.

Agent Features

Memory
Context refinement across debate rounds (iterative exchange)
Planning
Debate rounds with iterative refutationModerator-driven topic decomposition and agenda
Tool Use
LLMs as proponent/opponent agentsLLMs as judges (CRIT)Contentiousness parameter for tone control
Frameworks
SocraSynthCRIT
Is Agentic

Yes

Architectures
Multi-LLM agent ensemble (no new model weights)
Collaboration
Human moderator + two LLM agents + LLM judgesJoint proposal drafting after contention reduced

Optimization Features

Token Efficiency
Implicit: iterative rounds increase context but no explicit compression

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No public code or reproducible pipeline provided in paper.

Evidence comes from three case studies rather than large controlled benchmarks.

When Not To Use

High-stakes settings requiring formal verification without human oversight.

Tasks demanding provable factual accuracy rather than reasonableness.

Failure Modes

Both agents converge on the same false premise, making mutual checking ineffective.

Judge panel shares biases with agents, causing blind spots.

Core Entities

Models

GPT-4GPT-3.5text-davinci-003Bard

Metrics

CRIT score (1-10)Contentiousness level (0.0-1.0)

Datasets

Kaggle disease-symptom dataset (4,921 records)

Context Entities

Models

GPT-4GPT-3.5text-davinci-003Bard

Metrics

Judge totals (per-topic scores, Tables 5–6)

Datasets

Kaggle disease-symptom dataset