Retrieve both claim and its negation from multiple sources, aggregate evidence, and use LLM log-probs to expose cross-source disagreement.

February 21, 20267 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and reproducible: it combines standard retrieval tools with LLM zero-shot scoring and adds a useful confidence signal, but it needs calibration and better time-aware retrieval before high-stakes use.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 50%

Authors

Md Badsha Biswas, Ozlem Uzuner

Links

Abstract / PDF / Code

Why It Matters For Business

Aggregating evidence from multiple sources and retrieving negated queries expands coverage and surfaces disagreements, improving zero-shot claim checks and making automated decisions more transparent.

Who Should Care

Summary TLDR

This paper builds an open-domain claim verification pipeline that (1) generates a claim's explicit negation, (2) retrieves sentence-level evidence for both forms from Wikipedia, PubMed, and Google, (3) deduplicates and merges per-source sentences into a single evidence set, and (4) asks zero-shot LLMs to verify the claim. Negated retrieval and multi-source aggregation give consistent zero-shot gains (typical +2–10% accuracy, +2–8% macro F1 on evaluated datasets). The system also reports per-source label log-probabilities so users can see when sources disagree. Code is available.

Problem Statement

Most automated fact-checkers rely on a single knowledge source and only retrieve evidence that supports the claim. That narrows coverage and hides disagreements between sources. We need a practical method that finds both supporting and contradicting evidence across multiple sources and shows when sources disagree.

Main Contribution

A dual-perspective retrieval pipeline that generates a claim's explicit negation and retrieves evidence for both the original and negated claim.

A multi-source aggregation method that deduplicates and ranks sentence-level evidence from Wikipedia, PubMed, and Google into a single evidence set per claim.

Key Findings

Retrieving both the claim and its negation (dual-perspective) improves zero-shot verification.

Numbers+210% accuracy; +28% macroF1 (typical gains across datasets)

Practical UseIn practice, add negated-query retrieval to increase chance of finding refuting evidence and modestly boost verification performance without model finetuning.

Evidence RefResults section; Tables 1–2

Aggregating Wikipedia, PubMed, and Google often outperforms any single source.

NumbersExample: SciFact Llama70 merged 0.610 vs WP 0.430 (+41.9% rel)

Practical UseMerge diverse sources to widen coverage; expect meaningful gains on domain-mixed benchmarks, especially when a single source is weak.

Evidence RefTable 3 (SciFact; Llama 70B)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracySciFact Llama70 merged 0.610 (merged W+P+G)Wikipedia 0.430+0.180 (≈+41.9% rel vs WP)SciFactTable 3 merged vs per-source numbersTable 3
AccuracySciFact Llama70 Google only 0.550 → Original+Negated 0.607Original-only 0.550+0.057 (+10.4% rel)SciFact (Llama 70B + Google)Table 1; original vs original+negatedTable 1

What To Try In 7 Days

Add an explicit negated-query stage: generate a simple negation for each claim and run retrieval for both forms.

Pull sentence-level evidence from at least two diverse sources (e.g., Wikipedia + web search) and deduplicate before calling an LLM.

Log and visualize per-source label log-probs to flag claims with source disagreement for human review.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Context window limits can truncate multi-document evidence and harm verification.

No time-aware retrieval: outdated evidence can mislead time-sensitive claims.

When Not To Use

For high-stakes decisions without calibration and human review.

When evidence requires long multi-document chains exceeding context windows.

Failure Modes

LLM hallucination or label mis-mapping despite relevant evidence.

Outdated or misleading web evidence yields incorrect veracity.

Core Entities

Models

Llama 3.3 70BLlama 3.1 405BMistral-LargeQwen 2.5Phi-4

Metrics

AccuracyPrecisionRecallmacroF1label log-probability

Datasets

SciFactPubHealthAveritecLIAR

Benchmarks

SciFactPubHealthAveritecLIAR