DebUnc: use uncertainty estimates to steer multi-agent debates by scaling attention to confident agents

July 8, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

3

Authors

Luke Yoffe, Alfonso Amayuelas, William Yang Wang

Links

Abstract / PDF

Why It Matters For Business

If you run LLM agent systems, prioritizing more confident agents reduces the chance of the group converging on confidently wrong answers; attention-scaling gives the biggest payoff when you can measure confidence reliably.

Summary TLDR

DebUnc adds uncertainty estimates to multi-agent LLM debates and communicates them either by inserting confidence into prompts or by scaling Transformer attention toward more confident agents. On open-source models and standard benchmarks, scaling attention (Attention-All) gave the largest gains, especially when the uncertainty metric is reliable. The method needs model-level access (open models) and benefits most if you invest in better uncertainty measures.

Problem Statement

Multi-agent debates help LLMs correct mistakes, but confidently wrong agents can steer the group to incorrect consensus. Raw model outputs are poor proxies for confidence. The paper asks: can we (1) quantify agent confidence with uncertainty metrics and (2) communicate that confidence so debates become more accurate?

Main Contribution

DebUnc: a debate pipeline that measures each agent's uncertainty each round and shares it with peers.

A practical attention-scaling mechanism that increases token weights for more confident agents during generation.

An empirical comparison of two communication methods (confidence-in-prompt vs attention-scaling) across LLMs, benchmarks, and uncertainty metrics.

Analysis showing debate gains grow as uncertainty metrics become more accurate and a public code release.

Key Findings

Attention scaling (Attention-All) improves final debate accuracy when uncertainty estimates are good.

NumbersMistral: avg accuracy 0.67 (Attention-All, Oracle) vs 0.53 (standard) → +0.14

Mean Token Entropy is slightly better and cheaper than TokenSAR for identifying questionable answers.

NumbersAUROC: Entropy 0.627 vs TokenSAR 0.617 (avg across tests)

Accuracy gains from uncertainty communication scale with the AUROC of the uncertainty metric, and attention-based methods benefit most.

NumbersAttention-All slope 0.59 vs Prompt slope 0.17 (accuracy increase vs AUROC)

Results

Accuracy

Value0.53 (standard debate)

Accuracy

Value0.67

Baseline0.53 (standard)

Accuracy

Value0.63 (standard debate)

Accuracy

Value0.73

Baseline0.63 (standard)

Uncertainty AUROC (Mean Token Entropy, avg)

Value0.627

BaselineTokenSAR 0.617

Who Should Care

What To Try In 7 Days

Run a 3-agent, 3-round debate on a small, high-value prompt set and compare standard debate vs attention-scaling.

Compute mean token entropy for your model outputs as a low-cost uncertainty baseline and measure AUROC vs known answers.

If using open models, prototype attention-scaling on the previous-round responses and track accuracy and failure cases.

Agent Features

Memory

  • short-term chat history (previous-round responses used)

Planning

  • multi-round debate (3 rounds)

Tool Use

  • none (no external tools used in experiments)

Frameworks

  • DebUnc

Is Agentic

true

Architectures

  • Transformer decoder with modified attention scaling

Collaboration

  • multi-agent debate with shared responses and confidence values

Optimization Features

Token Efficiency

  • uses single-generation token-probability metrics (entropy) to save compute vs sampling

Inference Optimization

  • attention rescaling during generation

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires open-source models or model access that exposes token probabilities and allows attention changes.
  • Attention scaling is sensitive to the order of agents' responses; token ordering can leak information.
  • Oracle metric is unrealistic in practice; real gains depend on the quality of the uncertainty estimator.

When Not To Use

  • With closed-source models that hide token probabilities or internals
  • In ultra low-latency systems where extra computation for uncertainty is unacceptable
  • For single-turn tasks with no iterative debate value

Failure Modes

  • Group consensus on a confidently incorrect answer if the uncertainty metric misranks answers
  • Attention leakage where earlier tokens overly influence later agents due to prompt order
  • Small or no improvement when uncertainty AUROC is near random (≈0.5)

Core Entities

Models

  • Mistral-7B-Instruct-v0.2
  • Llama-3-8B-Instruct

Metrics

  • Mean Token Entropy
  • TokenSAR
  • Oracle (simulated)
  • AUROC
  • Accuracy

Datasets

  • MMLU
  • GSM8k
  • TruthfulQA
  • Arithmetic (synthetic)

Benchmarks

  • MMLU
  • GSM8k
  • TruthfulQA
  • Arithmetic