Retrieve both claim and its negation from multiple sources, aggregate evidence, and use LLM log-probs to expose cross-source disagreement.

Overview

Decision SnapshotNeeds Validation

The approach is practical and reproducible: it combines standard retrieval tools with LLM zero-shot scoring and adds a useful confidence signal, but it needs calibration and better time-aware retrieval before high-stakes use.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 50%

Authors

Md Badsha Biswas, Ozlem Uzuner

Links

Abstract / PDF / Code

Why It Matters For Business

Aggregating evidence from multiple sources and retrieving negated queries expands coverage and surfaces disagreements, improving zero-shot claim checks and making automated decisions more transparent.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper builds an open-domain claim verification pipeline that (1) generates a claim's explicit negation, (2) retrieves sentence-level evidence for both forms from Wikipedia, PubMed, and Google, (3) deduplicates and merges per-source sentences into a single evidence set, and (4) asks zero-shot LLMs to verify the claim. Negated retrieval and multi-source aggregation give consistent zero-shot gains (typical +2–10% accuracy, +2–8% macro F1 on evaluated datasets). The system also reports per-source label log-probabilities so users can see when sources disagree. Code is available.

Problem Statement

Most automated fact-checkers rely on a single knowledge source and only retrieve evidence that supports the claim. That narrows coverage and hides disagreements between sources. We need a practical method that finds both supporting and contradicting evidence across multiple sources and shows when sources disagree.

Main Contribution

A dual-perspective retrieval pipeline that generates a claim's explicit negation and retrieves evidence for both the original and negated claim.

A multi-source aggregation method that deduplicates and ranks sentence-level evidence from Wikipedia, PubMed, and Google into a single evidence set per claim.

Key Findings

Retrieving both the claim and its negation (dual-perspective) improves zero-shot verification.

Numbers+2–10% accuracy; +2–8% macroF1 (typical gains across datasets)

Practical UseIn practice, add negated-query retrieval to increase chance of finding refuting evidence and modestly boost verification performance without model finetuning.

Evidence RefResults section; Tables 1–2

Aggregating Wikipedia, PubMed, and Google often outperforms any single source.

NumbersExample: SciFact Llama70 merged 0.610 vs WP 0.430 (+41.9% rel)

Practical UseMerge diverse sources to widen coverage; expect meaningful gains on domain-mixed benchmarks, especially when a single source is weak.

Evidence RefTable 3 (SciFact; Llama 70B)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	SciFact Llama70 merged 0.610 (merged W+P+G)	Wikipedia 0.430	+0.180 (≈+41.9% rel vs WP)	SciFact	Table 3 merged vs per-source numbers	Table 3
Accuracy	SciFact Llama70 Google only 0.550 → Original+Negated 0.607	Original-only 0.550	+0.057 (+10.4% rel)	SciFact (Llama 70B + Google)	Table 1; original vs original+negated	Table 1

What To Try In 7 Days

Add an explicit negated-query stage: generate a simple negation for each claim and run retrieval for both forms.

Pull sentence-level evidence from at least two diverse sources (e.g., Wikipedia + web search) and deduplicate before calling an LLM.

Log and visualize per-source label log-probs to flag claims with source disagreement for human review.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/Automated-FactVerification-system-0BF7/

Risks & Boundaries

Limitations

Context window limits can truncate multi-document evidence and harm verification.

No time-aware retrieval: outdated evidence can mislead time-sensitive claims.

When Not To Use

For high-stakes decisions without calibration and human review.

When evidence requires long multi-document chains exceeding context windows.

Failure Modes

LLM hallucination or label mis-mapping despite relevant evidence.

Outdated or misleading web evidence yields incorrect veracity.

Core Entities

Models

Llama 3.3 70BLlama 3.1 405BMistral-LargeQwen 2.5Phi-4

Metrics

AccuracyPrecisionRecallmacroF1label log-probability

Datasets

SciFactPubHealthAveritecLIAR

Benchmarks

SciFactPubHealthAveritecLIAR

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Retrieving both the claim and its negation (dual-perspective) improves zero-shot verification.

Aggregating Wikipedia, PubMed, and Google often outperforms any single source.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding