Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
6
Why It Matters For Business
SocraSynth turns LLM outputs from single-shot answers into cross-checked, debate-driven recommendations, which reduces obvious bias and yields richer, testable proposals—useful for policy, diagnostics, and decision support.
Summary TLDR
SocraSynth is a multi-agent platform that runs controlled debates between two LLMs, uses other LLMs as judges (CRIT) to score argument quality, and adjusts a 'contentiousness' knob to move from adversarial to collaborative outputs. The paper reports three case studies (policy debate, medical symptom checking on a 4,921-record Kaggle dataset, and an analysis of contentiousness), shows judges prefer debate outputs to Q&A, and presents practical prompts and pseudocode. The system aims to reduce bias and hallucination by forcing cross-checks and iterative refinement rather than adding external logic components.
Problem Statement
LLMs are powerful but prone to bias, hallucination, and weak reasoning. The paper asks: can structured multi-LLM debates plus an LLM-based evaluation score (CRIT) and a tunable 'contentiousness' parameter produce more reliable, richer outputs than single-model Q&A?
Main Contribution
SocraSynth: a practical multi-LLM debate platform combining a human moderator, two opposing LLM agents, and multiple LLM judges.
Conditional statistics mechanism: each debating LLM conditions responses on the opponent's stance to surface diverse viewpoints.
Contentiousness modulation: a numeric knob (0–1) to steer debate tone from adversarial to conciliatory, with a scheduling rule (divide by 1.2 each round).
CRIT evaluation: an LLM-driven scoring template that rates argument reasonableness (1–10) and can be applied recursively to supporting sources.
Key Findings
Multi-agent debates scored higher than single-model Q&A on judged information quality.
Tuning contentiousness moves outputs from adversarial to cooperative and changes content emphasis.
Debate can improve medical triage: LLMs converged from different diagnoses to a specific recommendation and suggested confirmatory tests.
Results
Judge total scores (example, Table 5, GPT-4)
Judge total scores (role-swapped, Table 6, combined judges)
Medical-case convergence
Who Should Care
What To Try In 7 Days
Run a two-LLM debate (opposing prompts) on a product policy question and compare outputs to single-shot answers.
Add a small panel of LLM judges (e.g., GPT-4, GPT-3.5, text-davinci-003) to score reasonableness using a CRIT-like checklist.
Experiment with contentiousness: start at 0.9 for adversarial exploration, lower to ~0.3 to produce a joint proposal.
Agent Features
Memory
- Context refinement across debate rounds (iterative exchange)
Planning
- Debate rounds with iterative refutation
- Moderator-driven topic decomposition and agenda
Tool Use
- LLMs as proponent/opponent agents
- LLMs as judges (CRIT)
- Contentiousness parameter for tone control
Frameworks
- SocraSynth
- CRIT
Is Agentic
true
Architectures
- Multi-LLM agent ensemble (no new model weights)
Collaboration
- Human moderator + two LLM agents + LLM judges
- Joint proposal drafting after contention reduced
Optimization Features
Token Efficiency
- Implicit: iterative rounds increase context but no explicit compression
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No public code or reproducible pipeline provided in paper.
- Evidence comes from three case studies rather than large controlled benchmarks.
- CRIT judges are LLMs and may carry correlated biases with debating agents.
- No quantitative ablation isolating contentiousness from other variables.
When Not To Use
- High-stakes settings requiring formal verification without human oversight.
- Tasks demanding provable factual accuracy rather than reasonableness.
- Environments where access to multiple LLMs is cost-prohibitive.
Failure Modes
- Both agents converge on the same false premise, making mutual checking ineffective.
- Judge panel shares biases with agents, causing blind spots.
- Overfitting to debate style instead of factual grounding.
- Contentiousness tuning produces emotionally charged or misleading language.
Core Entities
Models
- GPT-4
- GPT-3.5
- text-davinci-003
- Bard
Metrics
- CRIT score (1-10)
- Contentiousness level (0.0-1.0)
Datasets
- Kaggle disease-symptom dataset (4,921 records)
Context Entities
Models
- GPT-4
- GPT-3.5
- text-davinci-003
- Bard
Metrics
- Judge totals (per-topic scores, Tables 5–6)
Datasets
- Kaggle disease-symptom dataset

