Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.65
Citation Count
9
Why It Matters For Business
Multi-agent LLM setups can raise reasoning accuracy without just scaling model size; using small teams (3 agents) and debate-first protocols often gives better answers while controlling API token costs.
Summary TLDR
This paper runs controlled experiments where small "machine societies" of LLM agents (easy-going or overconfident) collaborate via two thinking patterns—debate and reflection—over multiple rounds. Key practical findings: (1) strategies that start with or emphasize debate tend to improve accuracy on reasoning benchmarks; (2) keeping all agents in a round using the same thinking pattern raises reliability; (3) a 3-agent, 3-round setup is a pragmatic performance/cost sweet spot. The authors show LLMs display human-like group effects (conformity, consensus) and release code and data.
Problem Statement
Can multiple LLM instances working as agents show useful, human-like collaborative behaviors? If so, which multi-agent strategies and society settings (agent traits, thinking patterns, rounds, agent count) improve reasoning accuracy and token efficiency?
Main Contribution
A testbed that composes small LLM societies with agent "traits" (easy-going vs overconfident) and two thinking modes (debate vs reflection).
Systematic experiments on MMLU, MATH, and a Chess Move Validity task, comparing eight 3-round strategies and scaling agent count/rounds.
Empirical and statistical analysis showing debate-dominant strategies work best for many reasoning tasks, and that agents show conformity and consensus phenomena parallel to social psychology.
Key Findings
Debate-initial or debate-dominant strategies give higher accuracy on reasoning benchmarks.
Keeping the same thinking pattern for all agents in a round improves outcomes; mixing patterns harms performance.
Three agents and three collaboration rounds offer a good trade-off between accuracy and token cost.
Conformity and consensus effects are common; they can help or hurt performance and grow with more rounds.
Results
Accuracy
Accuracy
Token cost (aggregate per-strategy)
Significance (same-vs-mixed thinking patterns)
Who Should Care
What To Try In 7 Days
Run a 3-agent pipeline (all-debate) on a held-out task and compare accuracy + token cost to a single model.
Enforce uniform thinking per round (all agents debate or all reflect) and measure variance over 5 trials.
Log per-round answer changes to detect harmful conformity and add a light verifier step if consensus locks on incorrect answers.
Agent Features
Memory
- Short-term: keep prior-round dialog/history for next round
Planning
- Multi-round collaboration (3–10 rounds)
Frameworks
- Society of Mind (SoM)
Is Agentic
true
Architectures
- Chat-based LLMs (GPT-3.5/ChatGPT)
- LLaMA2-chat family
- Qwen-72B
- Mixtral 8x7B
Collaboration
- Debate (horizontal argumentation)
- Reflection (self-check / self-refine)
- Majority-vote aggregation
Optimization Features
Token Efficiency
- 3-agent, 3-round recommendation to reduce tokens
Reproducibility
Code Urls
Data Urls
- Datasets referenced are public: MMLU, MATH, BIG-Bench chess subset
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Did not mix different LLMs as agents; all agents share the same backbone in each experiment.
- Prompted traits (easy-going/overconfident) are simple role plays and may be muted by model alignment.
- Experiments limited to three datasets and sampled subsets (50 examples each).
- Manual/rule-based answer matching can miss edge-case equivalences.
When Not To Use
- For tasks requiring creative open-ended generation where majority voting is poor.
- If API token budget is severely constrained and debate rounds become too costly without clear gains.
- When agents must be heterogeneous models (cross-model mixing not explored).
Failure Modes
- Conformity locking on an incorrect consensus (groupthink effect).
- Reflection-heavy pipelines increase unstable, self-contradictory answers (hallucination risk).
- Long collaboration rounds raise token cost while offering diminishing returns.
Core Entities
Models
- gpt-3.5-turbo-1106 (ChatGPT)
- LlaMA2-13B-chat
- LlaMA2-70B-chat
- Qwen-72B
- Mixtral-8x7B
Metrics
- Accuracy
- Token cost (Cost)
- WIN-TIE (W-T)
- Consensus clusters (unique answers)
- ANOVA p-values
Datasets
- MMLU (50 sampled)
- MATH (50 sampled, levels 3-5)
- Chess Move Validity (BIG-bench subset, 50 sampled)
Benchmarks
- BIG-Bench (chess state tracking subset)

