BSDETECTOR: add a confidence score to any black-box LLM to flag bad answers and pick safer outputs

August 30, 20238 min

Overview

Decision SnapshotNeeds Validation

The idea is simple and tested across multiple public QA datasets; it needs extra API calls and depends on NLI quality, but the empirical evidence is solid for QA-style tasks.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 70%

Novelty: 50%

Authors

Jiuhai Chen, Jonas Mueller

Links

Abstract / PDF / Data

Why It Matters For Business

BSDETECTOR gives a practical confidence score for any API-only LLM so teams can detect risky outputs, route uncertain cases to humans or alternative models, and improve downstream metrics without retraining models.

Who Should Care

Summary TLDR

BSDETECTOR is a black-box method that estimates a numeric confidence for any LLM answer by (1) sampling diverse alternative outputs and measuring semantic contradiction with an NLI model (Observed Consistency) and (2) asking the LLM to self-evaluate via multiple-choice prompts (Self-reflection). Combining these yields confidence scores that better flag incorrect answers (AUROC gains across math, commonsense, and trivia) and let you (a) pick the safest answer among sampled outputs, and (b) improve reliability of automated LLM-based evaluation by routing or dropping low-confidence cases. The method costs extra API calls but requires no model fine-tuning.

Problem Statement

Large black-box LLMs often produce plausible but incorrect answers (hallucinate) and lack accessible token probabilities or training data. Practitioners need a way to know when an LLM's specific output is trustworthy without retraining or internal model access.

Main Contribution

BSDETECTOR: a black-box confidence estimator combining observed consistency (NLI-based contradiction checks across sampled outputs) and LLM self-reflection (multiple-choice confidence prompts).

Empirical results showing BSDETECTOR produces substantially better uncertainty scores (AUROC) than baselines on math, commonsense, and trivia QA using Text-Davinci-003 and GPT-3.5 Turbo.

Key Findings

BSDETECTOR yields higher AUROC for flagging wrong answers than baselines across multiple QA datasets

NumbersText-Davinci-003 AUROC: GSM8K 0.867, CSQA 0.743, SVAMP 0.936, TriviaQA 0.828

Practical UseUse BSDETECTOR to better detect incorrect LLM outputs on QA tasks, reducing blind trust in single answers.

Evidence RefTable 1 (Section 6.1)

On GPT-3.5 Turbo BSDETECTOR achieved very strong AUROC on math/trivia benchmarks

NumbersGPT-3.5 Turbo AUROC: GSM8K 0.951, CSQA 0.769, SVAMP 0.927, TriviaQA 0.817

Practical UseBSDETECTOR works even when token probabilities are unavailable; apply it to API-only models to flag risky outputs.

Evidence RefTable 1 (Section 6.1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AUROC (Text-Davinci-003, BSDETECTOR)GSM8K 0.867, CSQA 0.743, SVAMP 0.936, TriviaQA 0.828Likelihood/Temp sampling/Self-reflectionSubstantial AUROC gains vs baselines (see Table 1)GSM8K, CSQA, SVAMP, TriviaQATable 1 shows BSDETECTOR outperforms baselines across datasetsTable 1
AUROC (GPT-3.5 Turbo, BSDETECTOR)GSM8K 0.951, CSQA 0.769, SVAMP 0.927, TriviaQA 0.817Temperature sampling / Self-reflectionLarge improvements vs temperature sampling on same tasksGSM8K, CSQA, SVAMP, TriviaQATable 1 shows higher AUROC for BSDETECTORTable 1

What To Try In 7 Days

Implement BSDETECTOR pipeline: sample ~5 answers with CoT prompts and run an NLI contradiction check vs the reference output.

Add a two-question multiple-choice self-reflection prompt and average the scores into the BSDETECTOR confidence.

Use confidence to route low-scoring responses to human review or return 'I don't know' and measure reduction in critical errors.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GSM8KSVAMPCommonsenseQATriviaQASummarize-from-feedback

Risks & Boundaries

Limitations

Requires extra API calls (sampling + self-reflection), increasing latency and cost.

Observed Consistency relies on an external NLI model, which can misjudge short or single-token answers.

When Not To Use

Tight latency or budget constraints where extra sampling is infeasible.

Tasks with extremely short, single-token answers that break NLI contradiction checks.

Failure Modes

LLM consistently produces the same wrong answer; observed consistency may be high and miss the error.

NLI model misclassifies contradictions, producing misleading confidence scores.

Core Entities

Models

Text-Davinci-003GPT-3.5 TurboGPT-4DeBERTa-large (NLI)

Metrics

AUROCAccuracyMSE

Datasets

GSM8KSVAMPCommonsenseQA (CSQA)TriviaQASummarize-from-feedback

Benchmarks

math word problemscommonsense QAopen-domain triviasummary quality