Small groups of LLM agents that debate early beat naïve scaling; round consistency and 3×3 setups save tokens

October 3, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

9

Authors

Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, Shumin Deng

Links

Abstract / PDF

Why It Matters For Business

Multi-agent LLM setups can raise reasoning accuracy without just scaling model size; using small teams (3 agents) and debate-first protocols often gives better answers while controlling API token costs.

Summary TLDR

This paper runs controlled experiments where small "machine societies" of LLM agents (easy-going or overconfident) collaborate via two thinking patterns—debate and reflection—over multiple rounds. Key practical findings: (1) strategies that start with or emphasize debate tend to improve accuracy on reasoning benchmarks; (2) keeping all agents in a round using the same thinking pattern raises reliability; (3) a 3-agent, 3-round setup is a pragmatic performance/cost sweet spot. The authors show LLMs display human-like group effects (conformity, consensus) and release code and data.

Problem Statement

Can multiple LLM instances working as agents show useful, human-like collaborative behaviors? If so, which multi-agent strategies and society settings (agent traits, thinking patterns, rounds, agent count) improve reasoning accuracy and token efficiency?

Main Contribution

A testbed that composes small LLM societies with agent "traits" (easy-going vs overconfident) and two thinking modes (debate vs reflection).

Systematic experiments on MMLU, MATH, and a Chess Move Validity task, comparing eight 3-round strategies and scaling agent count/rounds.

Empirical and statistical analysis showing debate-dominant strategies work best for many reasoning tasks, and that agents show conformity and consensus phenomena parallel to social psychology.

Key Findings

Debate-initial or debate-dominant strategies give higher accuracy on reasoning benchmarks.

NumbersMMLU: p0p0p1 = 65.2 vs p1p0p0 = 34.4 (S4 example)

Keeping the same thinking pattern for all agents in a round improves outcomes; mixing patterns harms performance.

NumbersANOVA for mixed-vs-uniform patterns: p ≤ 0.001 on Chess Move Validity for p0p0p1 (Table 13)

Three agents and three collaboration rounds offer a good trade-off between accuracy and token cost.

NumbersAuthors report 3-agent setups outperform many larger groups and significance tests on agent count have p≈0.000 (Table 11

Conformity and consensus effects are common; they can help or hurt performance and grow with more rounds.

NumbersConformity becomes more frequent over rounds and shows model-dependent effects (beneficial on ChatGPT/Qwen, harmful on L

Results

Accuracy

Value70.4 ± 4.3 (S3, p0p0p0)

Baselinesingle-agent self-consistency (varies)

Accuracy

Value65.2 vs 34.4

Baselinep0p0p1 vs p1p0p0 in S4

Token cost (aggregate per-strategy)

Valuep0p0p0 cost ≈ 4364 tokens (All societies) vs p1p1p1 cost ≈ 1976

Significance (same-vs-mixed thinking patterns)

ValueANOVA p ≤ 0.001 (Chess Move Validity, p0p0p1)

Who Should Care

What To Try In 7 Days

Run a 3-agent pipeline (all-debate) on a held-out task and compare accuracy + token cost to a single model.

Enforce uniform thinking per round (all agents debate or all reflect) and measure variance over 5 trials.

Log per-round answer changes to detect harmful conformity and add a light verifier step if consensus locks on incorrect answers.

Agent Features

Memory

  • Short-term: keep prior-round dialog/history for next round

Planning

  • Multi-round collaboration (3–10 rounds)

Frameworks

  • Society of Mind (SoM)

Is Agentic

true

Architectures

  • Chat-based LLMs (GPT-3.5/ChatGPT)
  • LLaMA2-chat family
  • Qwen-72B
  • Mixtral 8x7B

Collaboration

  • Debate (horizontal argumentation)
  • Reflection (self-check / self-refine)
  • Majority-vote aggregation

Optimization Features

Token Efficiency

  • 3-agent, 3-round recommendation to reduce tokens

Reproducibility

Data Urls

  • Datasets referenced are public: MMLU, MATH, BIG-Bench chess subset

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Did not mix different LLMs as agents; all agents share the same backbone in each experiment.
  • Prompted traits (easy-going/overconfident) are simple role plays and may be muted by model alignment.
  • Experiments limited to three datasets and sampled subsets (50 examples each).
  • Manual/rule-based answer matching can miss edge-case equivalences.

When Not To Use

  • For tasks requiring creative open-ended generation where majority voting is poor.
  • If API token budget is severely constrained and debate rounds become too costly without clear gains.
  • When agents must be heterogeneous models (cross-model mixing not explored).

Failure Modes

  • Conformity locking on an incorrect consensus (groupthink effect).
  • Reflection-heavy pipelines increase unstable, self-contradictory answers (hallucination risk).
  • Long collaboration rounds raise token cost while offering diminishing returns.

Core Entities

Models

  • gpt-3.5-turbo-1106 (ChatGPT)
  • LlaMA2-13B-chat
  • LlaMA2-70B-chat
  • Qwen-72B
  • Mixtral-8x7B

Metrics

  • Accuracy
  • Token cost (Cost)
  • WIN-TIE (W-T)
  • Consensus clusters (unique answers)
  • ANOVA p-values

Datasets

  • MMLU (50 sampled)
  • MATH (50 sampled, levels 3-5)
  • Chess Move Validity (BIG-bench subset, 50 sampled)

Benchmarks

  • BIG-Bench (chess state tracking subset)