Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
Aggregating evidence from multiple sources and retrieving negated queries expands coverage and surfaces disagreements, improving zero-shot claim checks and making automated decisions more transparent.
Summary TLDR
This paper builds an open-domain claim verification pipeline that (1) generates a claim's explicit negation, (2) retrieves sentence-level evidence for both forms from Wikipedia, PubMed, and Google, (3) deduplicates and merges per-source sentences into a single evidence set, and (4) asks zero-shot LLMs to verify the claim. Negated retrieval and multi-source aggregation give consistent zero-shot gains (typical +2–10% accuracy, +2–8% macro F1 on evaluated datasets). The system also reports per-source label log-probabilities so users can see when sources disagree. Code is available.
Problem Statement
Most automated fact-checkers rely on a single knowledge source and only retrieve evidence that supports the claim. That narrows coverage and hides disagreements between sources. We need a practical method that finds both supporting and contradicting evidence across multiple sources and shows when sources disagree.
Main Contribution
A dual-perspective retrieval pipeline that generates a claim's explicit negation and retrieves evidence for both the original and negated claim.
A multi-source aggregation method that deduplicates and ranks sentence-level evidence from Wikipedia, PubMed, and Google into a single evidence set per claim.
A transparency mechanism using per-source label log-probabilities (LLM logprobs) to quantify and visualize inter-source agreement and uncertainty.
Zero-shot evaluation of five LLMs across four benchmarks showing consistent, practical gains from negation retrieval and source aggregation.
Key Findings
Retrieving both the claim and its negation (dual-perspective) improves zero-shot verification.
Aggregating Wikipedia, PubMed, and Google often outperforms any single source.
Per-source label log-probabilities (LLM logprobs) correlate with inter-source agreement and reveal uncertainty.
Results
Accuracy
Accuracy
Typical relative gains
Who Should Care
What To Try In 7 Days
Add an explicit negated-query stage: generate a simple negation for each claim and run retrieval for both forms.
Pull sentence-level evidence from at least two diverse sources (e.g., Wikipedia + web search) and deduplicate before calling an LLM.
Log and visualize per-source label log-probs to flag claims with source disagreement for human review.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Context window limits can truncate multi-document evidence and harm verification.
- No time-aware retrieval: outdated evidence can mislead time-sensitive claims.
- Zero-shot classification struggles on noisy, adversarial, or minority-label cases (e.g., LIAR).
- LLM confidence magnitudes vary across models; cross-model comparisons need calibration.
When Not To Use
- For high-stakes decisions without calibration and human review.
- When evidence requires long multi-document chains exceeding context windows.
- For time-sensitive claims unless retrieval is time-aware and aligned.
Failure Modes
- LLM hallucination or label mis-mapping despite relevant evidence.
- Outdated or misleading web evidence yields incorrect veracity.
- Truncated context leads to missed supporting/refuting passages.
- Source bias: dominant source content skews the merged evidence set.
Core Entities
Models
- Llama 3.3 70B
- Llama 3.1 405B
- Mistral-Large
- Qwen 2.5
- Phi-4
Metrics
- Accuracy
- Precision
- Recall
- macroF1
- label log-probability
Datasets
- SciFact
- PubHealth
- Averitec
- LIAR
Benchmarks
- SciFact
- PubHealth
- Averitec
- LIAR

