Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.55
Citation Count
7
Why It Matters For Business
BSDETECTOR gives a practical confidence score for any API-only LLM so teams can detect risky outputs, route uncertain cases to humans or alternative models, and improve downstream metrics without retraining models.
Summary TLDR
BSDETECTOR is a black-box method that estimates a numeric confidence for any LLM answer by (1) sampling diverse alternative outputs and measuring semantic contradiction with an NLI model (Observed Consistency) and (2) asking the LLM to self-evaluate via multiple-choice prompts (Self-reflection). Combining these yields confidence scores that better flag incorrect answers (AUROC gains across math, commonsense, and trivia) and let you (a) pick the safest answer among sampled outputs, and (b) improve reliability of automated LLM-based evaluation by routing or dropping low-confidence cases. The method costs extra API calls but requires no model fine-tuning.
Problem Statement
Large black-box LLMs often produce plausible but incorrect answers (hallucinate) and lack accessible token probabilities or training data. Practitioners need a way to know when an LLM's specific output is trustworthy without retraining or internal model access.
Main Contribution
BSDETECTOR: a black-box confidence estimator combining observed consistency (NLI-based contradiction checks across sampled outputs) and LLM self-reflection (multiple-choice confidence prompts).
Empirical results showing BSDETECTOR produces substantially better uncertainty scores (AUROC) than baselines on math, commonsense, and trivia QA using Text-Davinci-003 and GPT-3.5 Turbo.
Two practical applications: (a) choose the most confident answer among multiple samples to increase accuracy, and (b) improve automated LLM-based evaluation by routing low-confidence evaluations to humans or dropping them.
Key Findings
BSDETECTOR yields higher AUROC for flagging wrong answers than baselines across multiple QA datasets
On GPT-3.5 Turbo BSDETECTOR achieved very strong AUROC on math/trivia benchmarks
Selecting the sampled answer with highest BSDETECTOR score improves accuracy vs. the single reference answer
Using BSDETECTOR to triage LLM-based evaluations reduces evaluation error
More sampled outputs and CoT prompting modestly improve confidence quality
Results
AUROC (Text-Davinci-003, BSDETECTOR)
AUROC (GPT-3.5 Turbo, BSDETECTOR)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Implement BSDETECTOR pipeline: sample ~5 answers with CoT prompts and run an NLI contradiction check vs the reference output.
Add a two-question multiple-choice self-reflection prompt and average the scores into the BSDETECTOR confidence.
Use confidence to route low-scoring responses to human review or return 'I don't know' and measure reduction in critical errors.
Reproducibility
Data Urls
- GSM8K
- SVAMP
- CommonsenseQA
- TriviaQA
- Summarize-from-feedback
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires extra API calls (sampling + self-reflection), increasing latency and cost.
- Observed Consistency relies on an external NLI model, which can misjudge short or single-token answers.
- Self-reflection prompts can be overconfident if asked as continuous scores; authors use discrete multiple-choice to mitigate this.
- Open-domain datasets (TriviaQA) need manual validation because gold answers may not cover all valid responses.
When Not To Use
- Tight latency or budget constraints where extra sampling is infeasible.
- Tasks with extremely short, single-token answers that break NLI contradiction checks.
- Systems that cannot tolerate any additional API calls or prompting complexity.
Failure Modes
- LLM consistently produces the same wrong answer; observed consistency may be high and miss the error.
- NLI model misclassifies contradictions, producing misleading confidence scores.
- Self-reflection could be superficially persuasive for the LLM and not detect factual errors.
- Dropping low-confidence evaluations can bias downstream metrics if low-confidence items are not missing at random.
Core Entities
Models
- Text-Davinci-003
- GPT-3.5 Turbo
- GPT-4
- DeBERTa-large (NLI)
Metrics
- AUROC
- Accuracy
- MSE
Datasets
- GSM8K
- SVAMP
- CommonsenseQA (CSQA)
- TriviaQA
- Summarize-from-feedback
Benchmarks
- math word problems
- commonsense QA
- open-domain trivia
- summary quality

