WikiContradict: 253 human-curated Wikipedia contradiction QA pairs to test LLMs under RAG

Overview

Decision SnapshotNeeds Validation

The benchmark is ready for evaluation use; results reliably show weakness in two-passage conflict handling but are limited to English Wikipedia text and need judge calibration before high-stakes deployment.

Citations1

Evidence Strength0.80

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/6

Reproducibility

Status: Partial assets available

Open source: Partial

License: MIT

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 45%

Authors

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

Links

Abstract / PDF / Data

Why It Matters For Business

If your product uses RAG or multiple information sources, models can ignore conflicting documents or prefer internal knowledge; explicit conflict-detection and instruction improve answers and reduce wrong-but-plausible outputs.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors release WikiContradict: 253 human-annotated QA instances built from Wikipedia editor 'inconsistent' tags. Each instance pairs two real Wikipedia passages that contradict and a question with two valid answers. They benchmark several closed and open LLMs under RAG (single passage and two-passage) and run human evaluation (≈1,200 judged samples, >3,500 judgments reported) showing models typically ignore one passage or favor their internal knowledge. Prompting to pay explicit attention to contradictions can help (e.g., Llama-3-70b-instruct rose from 10.4% to 43.8% correct). They also build WikiContradictEval, a few-shot LLM judge that reaches ~0.80 F-score vs. human labels, to scale评

Problem Statement

LLMs augmented with retrieved documents (RAG) can face multiple equally credible sources that disagree. Existing conflict datasets are synthetic or focus on the model's memory vs. context. We lack a compact, human-verified benchmark that measures how models handle real-world inter-context contradictions (two credible passages that imply different answers).

Main Contribution

WikiContradict: a curated benchmark of 253 QA instances from Wikipedia editorial 'inconsistent' tags; each instance includes two contradictory passages and two source-specific answers.

Human evaluation protocol and five prompt templates to test internal knowledge, RAG with one passage, and RAG with two contradictory passages; collected ≈1,200 human-judged samples for analysis.

Key Findings

Models struggle to reflect contradictory context when given two conflicting passages.

NumbersCorrect rates on two-passage prompts often ≤ 44% (Llama-3 improved to 43.8%).

Practical UseDon't assume RAG will make LLM answers robust to conflicting sources; add explicit instruction or conflict-handling steps in production pipelines.

Evidence RefHuman eval, Section 4; Table 2; prompt 4 vs 5

Explicitly prompting models to consider contradictions can substantially increase correct answers.

NumbersLlama-3-70b-instruct: 10.4% → 43.8% correct (prompt 4 → prompt 5).

Practical UseAdd a short prompt that asks the model to list or acknowledge conflicting answers when you pass multiple sources.

Evidence RefSection 4; Table 2; prompt template comparison

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	253 instances	—	—	—	Section 3.3; Table 1	Table 1
Implicit contradictions	36%	—	—	WikiContradict	Section 3.3; Table 1	Table 1

What To Try In 7 Days

Run simple two-passage tests from WikiContradict to measure how your models treat conflicts.

Add a short prompt that asks the model to list differing answers or state there is a conflict before answering.

Use an LLM judge (few-shot) to scale evaluation and flag models that reconcile contradictions without warning.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseMIT

Data URLs

https://ibm.biz/wikicontradict

Risks & Boundaries

Limitations

English-only dataset; no multilingual coverage.

Built from Wikipedia maintenance tags — may bias toward certain contradiction types.

When Not To Use

Evaluating multilingual or non-English conflict handling.

Measuring multimodal contradiction resolution (images + text).

Failure Modes

Models ignore one of the contradictory passages and answer from a single source.

Models prefer their internal parametric knowledge over provided context ('stubborn' behavior).

Core Entities

Models

Mistral-7b-instructMixtral-8x7b-instructMistral-7b-instLlama-2-70b-chatLlama-2-13b-chatLlama-3-70b-instructLlama-3-8b-instructLlama-3-70b-instGPT-4-turbo-2024-04-09GPT-4o-2024-05-13Flan-ul2

Metrics

F1PrecisionRecallAccuracyCohen's kappa

Datasets

WikiContradictWikipedia

Benchmarks

WikiContradictFreshLLMTruthfulQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models struggle to reflect contradictory context when given two conflicting passages.

Explicitly prompting models to consider contradictions can substantially increase correct answers.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding