Make two LLMs argue, judge their claims, and tune debate tone to reduce bias and hallucination

Overview

Decision SnapshotNeeds Validation

The idea is practical and moderately validated by case studies, but lacks large-scale benchmarks, public code, and formal ablation studies.

Citations6

Evidence Strength0.60

Confidence0.60

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Edward Y. Chang

Links

Abstract / PDF / Data

Why It Matters For Business

SocraSynth turns LLM outputs from single-shot answers into cross-checked, debate-driven recommendations, which reduces obvious bias and yields richer, testable proposals—useful for policy, diagnostics, and decision support.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder Engineering Lead

Summary TLDR

SocraSynth is a multi-agent platform that runs controlled debates between two LLMs, uses other LLMs as judges (CRIT) to score argument quality, and adjusts a 'contentiousness' knob to move from adversarial to collaborative outputs. The paper reports three case studies (policy debate, medical symptom checking on a 4,921-record Kaggle dataset, and an analysis of contentiousness), shows judges prefer debate outputs to Q&A, and presents practical prompts and pseudocode. The system aims to reduce bias and hallucination by forcing cross-checks and iterative refinement rather than adding external logic components.

Problem Statement

LLMs are powerful but prone to bias, hallucination, and weak reasoning. The paper asks: can structured multi-LLM debates plus an LLM-based evaluation score (CRIT) and a tunable 'contentiousness' parameter produce more reliable, richer outputs than single-model Q&A?

Main Contribution

SocraSynth: a practical multi-LLM debate platform combining a human moderator, two opposing LLM agents, and multiple LLM judges.

Conditional statistics mechanism: each debating LLM conditions responses on the opponent's stance to surface diverse viewpoints.

Key Findings

Multi-agent debates scored higher than single-model Q&A on judged information quality.

NumbersTable 5 & 6: GPT-4 judge totals A=39 vs B=32 (Table 5); role-swapped totals remain comparable

Practical UseUse a two-agent debate plus LLM judges to get deeper, more balanced answers than a single LLM Q&A when you need richer justification.

Evidence RefTables 5 and 6; Section 3.1.4

Tuning contentiousness moves outputs from adversarial to cooperative and changes content emphasis.

NumbersContentiousness tested at {0.9,0.7,0.5,0.3,0.0}; scheduling: divide by 1.2 per round

Practical UseAdjust contentiousness (start high to surface trade-offs, lower to get consensus) to control tone and breadth of perspectives.

Evidence RefTable 1; SocraSynth algorithm in Section 2.1.3 and 3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Judge total scores (example, Table 5, GPT-4)	Agent A total 39 vs Agent B total 32	—	A +7	Policy debate (Section 3.1)	Table 5: GPT-4 columns; Agent A wins in that configuration	Table 5
Judge total scores (role-swapped, Table 6, combined judges)	Totals roughly tie or favor Agent A across configurations (examples: 38 vs 38, 36 vs 39, see table)	—	—	Policy debate role-reversal (Section 3.1.3)	Table 6 shows mixed totals but judges still find debate informative	Table 6

What To Try In 7 Days

Run a two-LLM debate (opposing prompts) on a product policy question and compare outputs to single-shot answers.

Add a small panel of LLM judges (e.g., GPT-4, GPT-3.5, text-davinci-003) to score reasonableness using a CRIT-like checklist.

Experiment with contentiousness: start at 0.9 for adversarial exploration, lower to ~0.3 to produce a joint proposal.

Agent Features

Memory

Context refinement across debate rounds (iterative exchange)

Planning

Debate rounds with iterative refutationModerator-driven topic decomposition and agenda

Tool Use

LLMs as proponent/opponent agentsLLMs as judges (CRIT)Contentiousness parameter for tone control

Frameworks

SocraSynthCRIT

Is Agentic

Yes

Architectures

Multi-LLM agent ensemble (no new model weights)

Collaboration

Human moderator + two LLM agents + LLM judgesJoint proposal drafting after contention reduced

Optimization Features

Token Efficiency

Implicit: iterative rounds increase context but no explicit compression

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset

Risks & Boundaries

Limitations

No public code or reproducible pipeline provided in paper.

Evidence comes from three case studies rather than large controlled benchmarks.

When Not To Use

High-stakes settings requiring formal verification without human oversight.

Tasks demanding provable factual accuracy rather than reasonableness.

Failure Modes

Both agents converge on the same false premise, making mutual checking ineffective.

Judge panel shares biases with agents, causing blind spots.

Core Entities

Models

GPT-4GPT-3.5text-davinci-003Bard

Metrics

CRIT score (1-10)Contentiousness level (0.0-1.0)

Datasets

Kaggle disease-symptom dataset (4,921 records)

Context Entities

Models

GPT-4GPT-3.5text-davinci-003Bard

Metrics

Judge totals (per-topic scores, Tables 5–6)

Datasets

Kaggle disease-symptom dataset

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-agent debates scored higher than single-model Q&A on judged information quality.

Tuning contentiousness moves outputs from adversarial to cooperative and changes content emphasis.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding