Overview
The idea is simple and tested across multiple public QA datasets; it needs extra API calls and depends on NLI quality, but the empirical evidence is solid for QA-style tasks.
Citations7
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 55%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
BSDETECTOR gives a practical confidence score for any API-only LLM so teams can detect risky outputs, route uncertain cases to humans or alternative models, and improve downstream metrics without retraining models.
Who Should Care
Summary TLDR
BSDETECTOR is a black-box method that estimates a numeric confidence for any LLM answer by (1) sampling diverse alternative outputs and measuring semantic contradiction with an NLI model (Observed Consistency) and (2) asking the LLM to self-evaluate via multiple-choice prompts (Self-reflection). Combining these yields confidence scores that better flag incorrect answers (AUROC gains across math, commonsense, and trivia) and let you (a) pick the safest answer among sampled outputs, and (b) improve reliability of automated LLM-based evaluation by routing or dropping low-confidence cases. The method costs extra API calls but requires no model fine-tuning.
Problem Statement
Large black-box LLMs often produce plausible but incorrect answers (hallucinate) and lack accessible token probabilities or training data. Practitioners need a way to know when an LLM's specific output is trustworthy without retraining or internal model access.
Main Contribution
BSDETECTOR: a black-box confidence estimator combining observed consistency (NLI-based contradiction checks across sampled outputs) and LLM self-reflection (multiple-choice confidence prompts).
Empirical results showing BSDETECTOR produces substantially better uncertainty scores (AUROC) than baselines on math, commonsense, and trivia QA using Text-Davinci-003 and GPT-3.5 Turbo.
Key Findings
BSDETECTOR yields higher AUROC for flagging wrong answers than baselines across multiple QA datasets
On GPT-3.5 Turbo BSDETECTOR achieved very strong AUROC on math/trivia benchmarks
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| AUROC (Text-Davinci-003, BSDETECTOR) | GSM8K 0.867, CSQA 0.743, SVAMP 0.936, TriviaQA 0.828 | Likelihood/Temp sampling/Self-reflection | Substantial AUROC gains vs baselines (see Table 1) | GSM8K, CSQA, SVAMP, TriviaQA | Table 1 shows BSDETECTOR outperforms baselines across datasets | Table 1 |
| AUROC (GPT-3.5 Turbo, BSDETECTOR) | GSM8K 0.951, CSQA 0.769, SVAMP 0.927, TriviaQA 0.817 | Temperature sampling / Self-reflection | Large improvements vs temperature sampling on same tasks | GSM8K, CSQA, SVAMP, TriviaQA | Table 1 shows higher AUROC for BSDETECTOR | Table 1 |
What To Try In 7 Days
Implement BSDETECTOR pipeline: sample ~5 answers with CoT prompts and run an NLI contradiction check vs the reference output.
Add a two-question multiple-choice self-reflection prompt and average the scores into the BSDETECTOR confidence.
Use confidence to route low-scoring responses to human review or return 'I don't know' and measure reduction in critical errors.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires extra API calls (sampling + self-reflection), increasing latency and cost.
Observed Consistency relies on an external NLI model, which can misjudge short or single-token answers.
When Not To Use
Tight latency or budget constraints where extra sampling is infeasible.
Tasks with extremely short, single-token answers that break NLI contradiction checks.
Failure Modes
LLM consistently produces the same wrong answer; observed consistency may be high and miss the error.
NLI model misclassifies contradictions, producing misleading confidence scores.

