Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
If you run LLM agent systems, prioritizing more confident agents reduces the chance of the group converging on confidently wrong answers; attention-scaling gives the biggest payoff when you can measure confidence reliably.
Summary TLDR
DebUnc adds uncertainty estimates to multi-agent LLM debates and communicates them either by inserting confidence into prompts or by scaling Transformer attention toward more confident agents. On open-source models and standard benchmarks, scaling attention (Attention-All) gave the largest gains, especially when the uncertainty metric is reliable. The method needs model-level access (open models) and benefits most if you invest in better uncertainty measures.
Problem Statement
Multi-agent debates help LLMs correct mistakes, but confidently wrong agents can steer the group to incorrect consensus. Raw model outputs are poor proxies for confidence. The paper asks: can we (1) quantify agent confidence with uncertainty metrics and (2) communicate that confidence so debates become more accurate?
Main Contribution
DebUnc: a debate pipeline that measures each agent's uncertainty each round and shares it with peers.
A practical attention-scaling mechanism that increases token weights for more confident agents during generation.
An empirical comparison of two communication methods (confidence-in-prompt vs attention-scaling) across LLMs, benchmarks, and uncertainty metrics.
Analysis showing debate gains grow as uncertainty metrics become more accurate and a public code release.
Key Findings
Attention scaling (Attention-All) improves final debate accuracy when uncertainty estimates are good.
Mean Token Entropy is slightly better and cheaper than TokenSAR for identifying questionable answers.
Accuracy gains from uncertainty communication scale with the AUROC of the uncertainty metric, and attention-based methods benefit most.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Uncertainty AUROC (Mean Token Entropy, avg)
Who Should Care
What To Try In 7 Days
Run a 3-agent, 3-round debate on a small, high-value prompt set and compare standard debate vs attention-scaling.
Compute mean token entropy for your model outputs as a low-cost uncertainty baseline and measure AUROC vs known answers.
If using open models, prototype attention-scaling on the previous-round responses and track accuracy and failure cases.
Agent Features
Memory
- short-term chat history (previous-round responses used)
Planning
- multi-round debate (3 rounds)
Tool Use
- none (no external tools used in experiments)
Frameworks
- DebUnc
Is Agentic
true
Architectures
- Transformer decoder with modified attention scaling
Collaboration
- multi-agent debate with shared responses and confidence values
Optimization Features
Token Efficiency
- uses single-generation token-probability metrics (entropy) to save compute vs sampling
Inference Optimization
- attention rescaling during generation
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires open-source models or model access that exposes token probabilities and allows attention changes.
- Attention scaling is sensitive to the order of agents' responses; token ordering can leak information.
- Oracle metric is unrealistic in practice; real gains depend on the quality of the uncertainty estimator.
When Not To Use
- With closed-source models that hide token probabilities or internals
- In ultra low-latency systems where extra computation for uncertainty is unacceptable
- For single-turn tasks with no iterative debate value
Failure Modes
- Group consensus on a confidently incorrect answer if the uncertainty metric misranks answers
- Attention leakage where earlier tokens overly influence later agents due to prompt order
- Small or no improvement when uncertainty AUROC is near random (≈0.5)
Core Entities
Models
- Mistral-7B-Instruct-v0.2
- Llama-3-8B-Instruct
Metrics
- Mean Token Entropy
- TokenSAR
- Oracle (simulated)
- AUROC
- Accuracy
Datasets
- MMLU
- GSM8k
- TruthfulQA
- Arithmetic (synthetic)
Benchmarks
- MMLU
- GSM8k
- TruthfulQA
- Arithmetic

