DebUnc: use uncertainty estimates to steer multi-agent debates by scaling attention to confident agents

July 8, 20246 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and tested on open models and benchmarks. It needs model-level access and good uncertainty metrics to deliver reliable gains.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Luke Yoffe, Alfonso Amayuelas, William Yang Wang

Links

Abstract / PDF / Code

Why It Matters For Business

If you run LLM agent systems, prioritizing more confident agents reduces the chance of the group converging on confidently wrong answers; attention-scaling gives the biggest payoff when you can measure confidence reliably.

Who Should Care

Summary TLDR

DebUnc adds uncertainty estimates to multi-agent LLM debates and communicates them either by inserting confidence into prompts or by scaling Transformer attention toward more confident agents. On open-source models and standard benchmarks, scaling attention (Attention-All) gave the largest gains, especially when the uncertainty metric is reliable. The method needs model-level access (open models) and benefits most if you invest in better uncertainty measures.

Problem Statement

Multi-agent debates help LLMs correct mistakes, but confidently wrong agents can steer the group to incorrect consensus. Raw model outputs are poor proxies for confidence. The paper asks: can we (1) quantify agent confidence with uncertainty metrics and (2) communicate that confidence so debates become more accurate?

Main Contribution

DebUnc: a debate pipeline that measures each agent's uncertainty each round and shares it with peers.

A practical attention-scaling mechanism that increases token weights for more confident agents during generation.

Key Findings

Attention scaling (Attention-All) improves final debate accuracy when uncertainty estimates are good.

NumbersMistral: avg accuracy 0.67 (Attention-All, Oracle) vs 0.53 (standard) → +0.14

Practical UseIf you can estimate confidence reliably, apply attention-scaling to prioritize confident agent tokens — it can raise aggregate accuracy by ~10–15 percentage points on tested benchmarks.

Evidence RefTable 1

Mean Token Entropy is slightly better and cheaper than TokenSAR for identifying questionable answers.

NumbersAUROC: Entropy 0.627 vs TokenSAR 0.617 (avg across tests)

Practical UseStart with mean token entropy for uncertainty because it is inexpensive and gives near-best discrimination; consider costly alternatives only if you need modest extra gains.

Evidence RefSection 5.2; Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.53 (standard debate)Average over MMLU-0/5, GSM8k, TruthfulQA, Arithmetic (sampled 100/q)Table 1 standard rowTable 1
Accuracy0.670.53 (standard)+0.14Average over the same benchmarksTable 1 Oracle + Attn-AllTable 1

What To Try In 7 Days

Run a 3-agent, 3-round debate on a small, high-value prompt set and compare standard debate vs attention-scaling.

Compute mean token entropy for your model outputs as a low-cost uncertainty baseline and measure AUROC vs known answers.

If using open models, prototype attention-scaling on the previous-round responses and track accuracy and failure cases.

Agent Features

Memory
short-term chat history (previous-round responses used)
Planning
multi-round debate (3 rounds)
Tool Use
none (no external tools used in experiments)
Frameworks
DebUnc
Is Agentic

Yes

Architectures
Transformer decoder with modified attention scaling
Collaboration
multi-agent debate with shared responses and confidence values

Optimization Features

Token Efficiency
uses single-generation token-probability metrics (entropy) to save compute vs sampling
Inference Optimization
attention rescaling during generation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires open-source models or model access that exposes token probabilities and allows attention changes.

Attention scaling is sensitive to the order of agents' responses; token ordering can leak information.

When Not To Use

With closed-source models that hide token probabilities or internals

In ultra low-latency systems where extra computation for uncertainty is unacceptable

Failure Modes

Group consensus on a confidently incorrect answer if the uncertainty metric misranks answers

Attention leakage where earlier tokens overly influence later agents due to prompt order

Core Entities

Models

Mistral-7B-Instruct-v0.2Llama-3-8B-Instruct

Metrics

Mean Token EntropyTokenSAROracle (simulated)AUROCAccuracy

Datasets

MMLUGSM8kTruthfulQAArithmetic (synthetic)

Benchmarks

MMLUGSM8kTruthfulQAArithmetic