Retrieve both claim and its negation from multiple sources, aggregate evidence, and use LLM log-probs to expose cross-source disagreement.

February 21, 20267 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

0

Authors

Md Badsha Biswas, Ozlem Uzuner

Links

Abstract / PDF

Why It Matters For Business

Aggregating evidence from multiple sources and retrieving negated queries expands coverage and surfaces disagreements, improving zero-shot claim checks and making automated decisions more transparent.

Summary TLDR

This paper builds an open-domain claim verification pipeline that (1) generates a claim's explicit negation, (2) retrieves sentence-level evidence for both forms from Wikipedia, PubMed, and Google, (3) deduplicates and merges per-source sentences into a single evidence set, and (4) asks zero-shot LLMs to verify the claim. Negated retrieval and multi-source aggregation give consistent zero-shot gains (typical +2–10% accuracy, +2–8% macro F1 on evaluated datasets). The system also reports per-source label log-probabilities so users can see when sources disagree. Code is available.

Problem Statement

Most automated fact-checkers rely on a single knowledge source and only retrieve evidence that supports the claim. That narrows coverage and hides disagreements between sources. We need a practical method that finds both supporting and contradicting evidence across multiple sources and shows when sources disagree.

Main Contribution

A dual-perspective retrieval pipeline that generates a claim's explicit negation and retrieves evidence for both the original and negated claim.

A multi-source aggregation method that deduplicates and ranks sentence-level evidence from Wikipedia, PubMed, and Google into a single evidence set per claim.

A transparency mechanism using per-source label log-probabilities (LLM logprobs) to quantify and visualize inter-source agreement and uncertainty.

Zero-shot evaluation of five LLMs across four benchmarks showing consistent, practical gains from negation retrieval and source aggregation.

Key Findings

Retrieving both the claim and its negation (dual-perspective) improves zero-shot verification.

Numbers+2–10% accuracy; +2–8% macroF1 (typical gains across datasets)

Aggregating Wikipedia, PubMed, and Google often outperforms any single source.

NumbersExample: SciFact Llama70 merged 0.610 vs WP 0.430 (+41.9% rel)

Per-source label log-probabilities (LLM logprobs) correlate with inter-source agreement and reveal uncertainty.

NumbersKDEs show sharper peaks under unanimity and broader dispersion under disagreement (Averitec fig.)

Results

Accuracy

ValueSciFact Llama70 merged 0.610 (merged W+P+G)

BaselineWikipedia 0.430

Accuracy

ValueSciFact Llama70 Google only 0.550 → Original+Negated 0.607

BaselineOriginal-only 0.550

Typical relative gains

ValueTypical +2–10% accuracy; +2–8% macroF1 (across datasets/LLMs)

BaselineOriginal-only retrieval

Who Should Care

What To Try In 7 Days

Add an explicit negated-query stage: generate a simple negation for each claim and run retrieval for both forms.

Pull sentence-level evidence from at least two diverse sources (e.g., Wikipedia + web search) and deduplicate before calling an LLM.

Log and visualize per-source label log-probs to flag claims with source disagreement for human review.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Context window limits can truncate multi-document evidence and harm verification.
  • No time-aware retrieval: outdated evidence can mislead time-sensitive claims.
  • Zero-shot classification struggles on noisy, adversarial, or minority-label cases (e.g., LIAR).
  • LLM confidence magnitudes vary across models; cross-model comparisons need calibration.

When Not To Use

  • For high-stakes decisions without calibration and human review.
  • When evidence requires long multi-document chains exceeding context windows.
  • For time-sensitive claims unless retrieval is time-aware and aligned.

Failure Modes

  • LLM hallucination or label mis-mapping despite relevant evidence.
  • Outdated or misleading web evidence yields incorrect veracity.
  • Truncated context leads to missed supporting/refuting passages.
  • Source bias: dominant source content skews the merged evidence set.

Core Entities

Models

  • Llama 3.3 70B
  • Llama 3.1 405B
  • Mistral-Large
  • Qwen 2.5
  • Phi-4

Metrics

  • Accuracy
  • Precision
  • Recall
  • macroF1
  • label log-probability

Datasets

  • SciFact
  • PubHealth
  • Averitec
  • LIAR

Benchmarks

  • SciFact
  • PubHealth
  • Averitec
  • LIAR