BSDETECTOR: add a confidence score to any black-box LLM to flag bad answers and pick safer outputs

August 30, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.55

Citation Count

7

Authors

Jiuhai Chen, Jonas Mueller

Links

Abstract / PDF

Why It Matters For Business

BSDETECTOR gives a practical confidence score for any API-only LLM so teams can detect risky outputs, route uncertain cases to humans or alternative models, and improve downstream metrics without retraining models.

Summary TLDR

BSDETECTOR is a black-box method that estimates a numeric confidence for any LLM answer by (1) sampling diverse alternative outputs and measuring semantic contradiction with an NLI model (Observed Consistency) and (2) asking the LLM to self-evaluate via multiple-choice prompts (Self-reflection). Combining these yields confidence scores that better flag incorrect answers (AUROC gains across math, commonsense, and trivia) and let you (a) pick the safest answer among sampled outputs, and (b) improve reliability of automated LLM-based evaluation by routing or dropping low-confidence cases. The method costs extra API calls but requires no model fine-tuning.

Problem Statement

Large black-box LLMs often produce plausible but incorrect answers (hallucinate) and lack accessible token probabilities or training data. Practitioners need a way to know when an LLM's specific output is trustworthy without retraining or internal model access.

Main Contribution

BSDETECTOR: a black-box confidence estimator combining observed consistency (NLI-based contradiction checks across sampled outputs) and LLM self-reflection (multiple-choice confidence prompts).

Empirical results showing BSDETECTOR produces substantially better uncertainty scores (AUROC) than baselines on math, commonsense, and trivia QA using Text-Davinci-003 and GPT-3.5 Turbo.

Two practical applications: (a) choose the most confident answer among multiple samples to increase accuracy, and (b) improve automated LLM-based evaluation by routing low-confidence evaluations to humans or dropping them.

Key Findings

BSDETECTOR yields higher AUROC for flagging wrong answers than baselines across multiple QA datasets

NumbersText-Davinci-003 AUROC: GSM8K 0.867, CSQA 0.743, SVAMP 0.936, TriviaQA 0.828

On GPT-3.5 Turbo BSDETECTOR achieved very strong AUROC on math/trivia benchmarks

NumbersGPT-3.5 Turbo AUROC: GSM8K 0.951, CSQA 0.769, SVAMP 0.927, TriviaQA 0.817

Selecting the sampled answer with highest BSDETECTOR score improves accuracy vs. the single reference answer

NumbersExample: GPT-3.5 Turbo SVAMP accuracy rose 75.3% → 82.0% (+6.7pp)

Using BSDETECTOR to triage LLM-based evaluations reduces evaluation error

NumbersDropping bottom 20% low-confidence GPT-4 evaluations improved matching to human averages (Figures 4–5)

More sampled outputs and CoT prompting modestly improve confidence quality

NumbersAblation: 5→10 outputs increased AUROC (e.g., GSM8K 0.951→0.961); removing CoT drops AUROC (GSM8K 0.951→0.837)

Results

AUROC (Text-Davinci-003, BSDETECTOR)

ValueGSM8K 0.867, CSQA 0.743, SVAMP 0.936, TriviaQA 0.828

BaselineLikelihood/Temp sampling/Self-reflection

AUROC (GPT-3.5 Turbo, BSDETECTOR)

ValueGSM8K 0.951, CSQA 0.769, SVAMP 0.927, TriviaQA 0.817

BaselineTemperature sampling / Self-reflection

Accuracy

ValueExamples: GPT-3.5 Turbo SVAMP 75.3% → 82.0%; Text-Davinci-003 GSM8K 47.5% → 69.4% (dataset-specific)

BaselineReference single answer (temp=0)

Accuracy

ValueGPT-4 evaluator accuracy on TriviaQA 83.67%; summary evaluator MSE ~0.707

BaselineHuman evaluation

Who Should Care

What To Try In 7 Days

Implement BSDETECTOR pipeline: sample ~5 answers with CoT prompts and run an NLI contradiction check vs the reference output.

Add a two-question multiple-choice self-reflection prompt and average the scores into the BSDETECTOR confidence.

Use confidence to route low-scoring responses to human review or return 'I don't know' and measure reduction in critical errors.

Reproducibility

Data Urls

  • GSM8K
  • SVAMP
  • CommonsenseQA
  • TriviaQA
  • Summarize-from-feedback

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires extra API calls (sampling + self-reflection), increasing latency and cost.
  • Observed Consistency relies on an external NLI model, which can misjudge short or single-token answers.
  • Self-reflection prompts can be overconfident if asked as continuous scores; authors use discrete multiple-choice to mitigate this.
  • Open-domain datasets (TriviaQA) need manual validation because gold answers may not cover all valid responses.

When Not To Use

  • Tight latency or budget constraints where extra sampling is infeasible.
  • Tasks with extremely short, single-token answers that break NLI contradiction checks.
  • Systems that cannot tolerate any additional API calls or prompting complexity.

Failure Modes

  • LLM consistently produces the same wrong answer; observed consistency may be high and miss the error.
  • NLI model misclassifies contradictions, producing misleading confidence scores.
  • Self-reflection could be superficially persuasive for the LLM and not detect factual errors.
  • Dropping low-confidence evaluations can bias downstream metrics if low-confidence items are not missing at random.

Core Entities

Models

  • Text-Davinci-003
  • GPT-3.5 Turbo
  • GPT-4
  • DeBERTa-large (NLI)

Metrics

  • AUROC
  • Accuracy
  • MSE

Datasets

  • GSM8K
  • SVAMP
  • CommonsenseQA (CSQA)
  • TriviaQA
  • Summarize-from-feedback

Benchmarks

  • math word problems
  • commonsense QA
  • open-domain trivia
  • summary quality