Overview
The benchmark is ready for evaluation use; results reliably show weakness in two-passage conflict handling but are limited to English Wikipedia text and need judge calibration before high-stakes deployment.
Citations1
Evidence Strength0.80
Confidence0.88
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/6
Reproducibility
Status: Partial assets available
Open source: Partial
License: MIT
At A Glance
Cost impact: 30%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
If your product uses RAG or multiple information sources, models can ignore conflicting documents or prefer internal knowledge; explicit conflict-detection and instruction improve answers and reduce wrong-but-plausible outputs.
Who Should Care
Summary TLDR
The authors release WikiContradict: 253 human-annotated QA instances built from Wikipedia editor 'inconsistent' tags. Each instance pairs two real Wikipedia passages that contradict and a question with two valid answers. They benchmark several closed and open LLMs under RAG (single passage and two-passage) and run human evaluation (≈1,200 judged samples, >3,500 judgments reported) showing models typically ignore one passage or favor their internal knowledge. Prompting to pay explicit attention to contradictions can help (e.g., Llama-3-70b-instruct rose from 10.4% to 43.8% correct). They also build WikiContradictEval, a few-shot LLM judge that reaches ~0.80 F-score vs. human labels, to scale评
Problem Statement
LLMs augmented with retrieved documents (RAG) can face multiple equally credible sources that disagree. Existing conflict datasets are synthetic or focus on the model's memory vs. context. We lack a compact, human-verified benchmark that measures how models handle real-world inter-context contradictions (two credible passages that imply different answers).
Main Contribution
WikiContradict: a curated benchmark of 253 QA instances from Wikipedia editorial 'inconsistent' tags; each instance includes two contradictory passages and two source-specific answers.
Human evaluation protocol and five prompt templates to test internal knowledge, RAG with one passage, and RAG with two contradictory passages; collected ≈1,200 human-judged samples for analysis.
Key Findings
Models struggle to reflect contradictory context when given two conflicting passages.
Explicitly prompting models to consider contradictions can substantially increase correct answers.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 253 instances | — | — | — | Section 3.3; Table 1 | Table 1 |
| Implicit contradictions | 36% | — | — | WikiContradict | Section 3.3; Table 1 | Table 1 |
What To Try In 7 Days
Run simple two-passage tests from WikiContradict to measure how your models treat conflicts.
Add a short prompt that asks the model to list differing answers or state there is a conflict before answering.
Use an LLM judge (few-shot) to scale evaluation and flag models that reconcile contradictions without warning.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
English-only dataset; no multilingual coverage.
Built from Wikipedia maintenance tags — may bias toward certain contradiction types.
When Not To Use
Evaluating multilingual or non-English conflict handling.
Measuring multimodal contradiction resolution (images + text).
Failure Modes
Models ignore one of the contradictory passages and answer from a single source.
Models prefer their internal parametric knowledge over provided context ('stubborn' behavior).

