Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

October 15, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth M. Daly, Qian Pan, Martín Santillán Cooper, James M. Johnson, Werner Geyer

Links

Abstract / PDF

Why It Matters For Business

You can flag LLM-generated ratings that are likely correct and route uncertain ones to humans, reducing verification cost and raising trust in automated evaluations.

Summary TLDR

The paper introduces a black-box, confusion-based method to estimate uncertainty when using LLMs as evaluators. For each possible rating option the judge LLM is prompted to justify that option. Those justifications are cross-conditioned with every option to build a confusion matrix from token probabilities. If one option keeps high probability across all biased assessments, the method marks the evaluation as low uncertainty. Across five datasets and three instruct models, low-uncertainty labels correlate strongly with higher accuracy, but the method is O(n^2) in inference calls and is compute-heavy for large models.

Problem Statement

LLM-as-a-Judge evaluations can disagree with humans and lack a reliable uncertainty signal. Practitioners need a practical, black-box way to flag which LLM-generated ratings are likely correct so human effort can be focused where needed.

Main Contribution

A black-box 'confusion-based' uncertainty method that uses biased verbal assessments and token probabilities to build a confusion matrix.

A binary uncertainty labeling rule (low/high) based on mean token probabilities and a threshold α.

Empirical evaluation across five public datasets and three instruct LLMs showing low-uncertainty labels often align with higher accuracy.

Practical analysis of threshold effects and discussion of compute and generalization trade-offs.

Key Findings

Low-uncertainty labels mark more accurate LLM evaluations.

NumbersTruthfulQA Llama-3-70B: Low 0.91 vs Baseline 0.78 (Table 1)

Proportion of low-uncertainty labels depends on model size and type.

NumbersLlama-3-70B labels >15% low vs <5% for smaller models on several datasets (Figure 8)

Small sets of low-uncertainty labels can be extremely reliable.

NumbersFeedback Collection: Low-uncertainty accuracy = 1.0 across models (Table 1)

Method has quadratic inference cost and threshold trades off accuracy vs quantity.

NumbersComplexity O(n^2) calls; threshold grid shows parabolic accuracy vs proportion tradeoff (Section 3.1, 4; Figure 6)

Results

Accuracy

Value0.91

Baseline0.78

Accuracy

Value1.0

Baseline0.4

Proportion labeled low uncertainty

Value>15%

Baseline<5%

Who Should Care

What To Try In 7 Days

Run the confusion-based procedure on a small sample of your evaluation tasks to measure low-uncertainty accuracy.

Tune the threshold α to balance how many auto-accepted items you want versus required precision.

If inference cost is high, try consolidating assessments into fewer prompts before scaling.

Optimization Features

System Optimization

  • threshold tuning to trade off precision and volume

Inference Optimization

  • consolidate assessments into single prompt (proposed)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Quadratic inference cost (n^2) makes the method expensive for many options or large models.
  • Performance depends on instruct-style models; may drop for models not fine-tuned for evaluation.
  • Threshold α must be tuned per use case; wrong choice changes the quantity/quality tradeoff.
  • Current binary label uses limited matrix information; richer scoring is suggested but not implemented.

When Not To Use

  • When you need fast, low-cost batch evaluation at large scale with many options.
  • When your model cannot be prompted reliably as an instruct-style judge.
  • When you lack compute to run n^2 inference calls per criterion.

Failure Modes

  • Overconfidence: matrix may show a single high row for a confidently wrong option.
  • Sparse/confused matrices when option count grows, reducing label usefulness.
  • Prompt sensitivity: poor assessment prompts can misrepresent model beliefs.

Core Entities

Models

  • Mixtral-8x7B-Instruct-v01
  • Llama-3-8B-Instruct
  • Llama-3-70B-Instruct

Metrics

  • Accuracy
  • proportion_low_uncertainty

Datasets

  • TruthfulQA
  • Reliance Study
  • Summarization CNN/DM
  • Feedback Collection
  • FeedbackQA