WikiContradict: 253 human-curated Wikipedia contradiction QA pairs to test LLMs under RAG

June 19, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.3

Citation Count

1

Authors

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

Links

Abstract / PDF

Why It Matters For Business

If your product uses RAG or multiple information sources, models can ignore conflicting documents or prefer internal knowledge; explicit conflict-detection and instruction improve answers and reduce wrong-but-plausible outputs.

Summary TLDR

The authors release WikiContradict: 253 human-annotated QA instances built from Wikipedia editor 'inconsistent' tags. Each instance pairs two real Wikipedia passages that contradict and a question with two valid answers. They benchmark several closed and open LLMs under RAG (single passage and two-passage) and run human evaluation (≈1,200 judged samples, >3,500 judgments reported) showing models typically ignore one passage or favor their internal knowledge. Prompting to pay explicit attention to contradictions can help (e.g., Llama-3-70b-instruct rose from 10.4% to 43.8% correct). They also build WikiContradictEval, a few-shot LLM judge that reaches ~0.80 F-score vs. human labels, to scale评

Problem Statement

LLMs augmented with retrieved documents (RAG) can face multiple equally credible sources that disagree. Existing conflict datasets are synthetic or focus on the model's memory vs. context. We lack a compact, human-verified benchmark that measures how models handle real-world inter-context contradictions (two credible passages that imply different answers).

Main Contribution

WikiContradict: a curated benchmark of 253 QA instances from Wikipedia editorial 'inconsistent' tags; each instance includes two contradictory passages and two source-specific answers.

Human evaluation protocol and five prompt templates to test internal knowledge, RAG with one passage, and RAG with two contradictory passages; collected ≈1,200 human-judged samples for analysis.

WikiContradictEval: a few-shot automatic judge (LLM-based) that achieves ~0.80 F-score against human labels, enabling scaled automatic evaluation.

Key Findings

Models struggle to reflect contradictory context when given two conflicting passages.

NumbersCorrect rates on two-passage prompts often ≤ 44% (Llama-3 improved to 43.8%).

Explicitly prompting models to consider contradictions can substantially increase correct answers.

NumbersLlama-3-70b-instruct: 10.4% → 43.8% correct (prompt 4 → prompt 5).

A judge LLM can approximate human labels with high F-score for two-passage conflict evaluation.

NumbersLlama-3-70b-instruct judge F1≈0.80; GPT-4 judge F1≈0.825.

Dataset composition: explicit conflicts dominate but a sizeable portion require reasoning.

Numbers253 instances total; 64% explicit, 36% implicit contradictions.

Human annotation agreement varied but was acceptable.

NumbersCohen’s κ between 0.58 and 0.88 across templates.

Results

Dataset size

Value253 instances

Implicit contradictions

Value36%

Human-evaluated samples

Value1,200 judged responses (after adjudication)

Prompt-driven improvement (example)

Value10.4% → 43.8% correct

Baselineprompt 4

Automatic judge F-score

Value≈0.80 F1 (Llama-3-70b-instruct judge)

Baselinehuman labels

Best judge (closed) F-score

Value≈0.825 F1 (GPT-4 judge)

Baselinehuman labels

Who Should Care

What To Try In 7 Days

Run simple two-passage tests from WikiContradict to measure how your models treat conflicts.

Add a short prompt that asks the model to list differing answers or state there is a conflict before answering.

Use an LLM judge (few-shot) to scale evaluation and flag models that reconcile contradictions without warning.

Reproducibility

License

  • MIT

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • English-only dataset; no multilingual coverage.
  • Built from Wikipedia maintenance tags — may bias toward certain contradiction types.
  • Text-only contradictions; no multimodal (image/text) cases included.
  • Judge LLMs can be biased (e.g., over-counts reconciliatory answers for some models).

When Not To Use

  • Evaluating multilingual or non-English conflict handling.
  • Measuring multimodal contradiction resolution (images + text).
  • Replacing careful human review in high-stakes factual verification without judge validation.

Failure Modes

  • Models ignore one of the contradictory passages and answer from a single source.
  • Models prefer their internal parametric knowledge over provided context ('stubborn' behavior).
  • Judge LLMs overestimate correctness when models list both answers without indicating conflict resolution.
  • Implicit contradictions (require reasoning) remain especially hard for models.

Core Entities

Models

  • Mistral-7b-instruct
  • Mixtral-8x7b-instruct
  • Mistral-7b-inst
  • Llama-2-70b-chat
  • Llama-2-13b-chat
  • Llama-3-70b-instruct
  • Llama-3-8b-instruct
  • Llama-3-70b-inst
  • GPT-4-turbo-2024-04-09
  • GPT-4o-2024-05-13
  • Flan-ul2

Metrics

  • F1
  • Precision
  • Recall
  • Accuracy
  • Cohen's kappa

Datasets

  • WikiContradict
  • Wikipedia

Benchmarks

  • WikiContradict
  • FreshLLM
  • TruthfulQA