MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

May 20, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva

Links

Abstract / PDF

Why It Matters For Business

Structured KG evidence can be injected into prompts to measurably reduce hallucinations and improve answer fidelity across languages, lowering risk for information-sensitive products.

Summary TLDR

MultiHal is a multilingual, multi-hop benchmark that links existing hallucination/QA datasets to Wikidata paths. The authors mined ~140k candidate KG paths, filtered them with an LLM-as-a-judge down to 25.9k high-quality paths covering 7,095 unique questions, then translated Q/A+paths into five European languages. Baseline tests show adding KG paths as in-context knowledge (KG-RAG) raises semantic similarity, NLI entailment, and hallucination-detection scores versus vanilla QA across models and languages. The dataset, code and data are public.

Problem Statement

Existing hallucination benchmarks are English-centric and text-based and do not use structured knowledge from knowledge graphs (KGs). This limits multilingual factuality evaluation and the testing of KG-based methods for reducing hallucinations in LLM outputs.

Main Contribution

A multilingual, multi-hop benchmark (MultiHal) that links QA/hallucination questions to Wikidata KG paths and translations.

A scalable pipeline: entity linking (Falcon 2.0 + DBpedia/Wikipedia mapping), SPARQL path mining (≤2 hops) and LLM-as-a-judge filtering.

Public release of data and code (CC-BY-4.0) including translations to German, Italian, French, Portuguese and Spanish.

Empirical baselines showing KG paths as in-context knowledge (KG-RAG) improve factuality metrics over vanilla QA.

Key Findings

Dataset scale and multilingual coverage

Numbers25,905 KG-path data points; 7,095 unique questions; translations to 5 languages (+English)

KG-RAG boosts factuality vs vanilla QA

Numberssemantic similarity +0.12–0.36; NLI entailment +0.16–0.36; hallucination detection +0.29–0.42 (KG-RAG vs QA)

LLM-as-a-judge filters paths but is noisy

NumbersGPT-4o Mini IAA Cohen-Kappa 0.62; false positives 11%; false negatives 2.78%; path-semantic correlation ρ≈0.485

Entity linking is a bottleneck

NumbersHigh noise from Falcon 2.0 necessitated heavy filtering (many low-quality paths rated 1–3)

Results

semantic_similarity (mean dot-product)

Valuevaries by model/language; e.g., Gemini Eng QA 0.51 → KG-RAG 0.83

Baselinevanilla QA (no KG)

NLI entailment (percent entailment)

Valueentailment rates increased per model (examples: GPT-4o-Mini QA 42.7% → KG-RAG 81.74% on some splits)

Baselinevanilla QA

Hallucination detection (HHEM-2.1 consistent %)

Valueconsistent rates improved (examples: GPT-4o-Mini KG-RAG ~89% consistent vs QA lower)

Baselinevanilla QA

LLM-as-a-judge reliability

ValueCohen-Kappa 0.62; false positives 11%; false negatives 2.78%

Who Should Care

What To Try In 7 Days

Run a small KG-RAG pilot: attach 1–2-hop Wikidata paths to common QA prompts and compare outputs vs vanilla QA.

Validate path quality manually on top 200 queries to catch entity-linking errors before scaling.

Use NLI + an open hallucination detector to triangulate improvements instead of relying on one metric.

Agent Features

Tool Use

  • OpenRouter API
  • SPARQL / Wikidata queries
  • Falcon 2.0 entity linking

Reproducibility

License

  • CC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies primarily on Wikidata — domain coverage gaps (medical, finance) reduce usefulness for specialized fields.
  • Multilingual scope is top European languages only (DE/FR/IT/ES/PT), limited typological diversity.
  • LLM-as-a-judge is closed-source and introduces noise (≈11% false positives).
  • Evaluation metrics are aggregate and do not localize exact hallucinated text spans.
  • KG-RAG injection approach is simple in-prompt conditioning; advanced knowledge encoding not evaluated.

When Not To Use

  • When multi-turn dialogue or summarization tasks are primary (MultiHal is single-turn QA focused).
  • For domain-specific QA (medical, legal) without domain-specific KGs.
  • As the sole ground truth for path quality without human checks when high assurance is required.

Failure Modes

  • Entity linking errors (Falcon 2.0 produced many irrelevant entities), producing wrong KG paths.
  • Temporal/indexical questions where Wikidata is outdated or time-dependent answers shift.
  • Suggestive/leading questions that require multi-step logical reasoning beyond short KG paths.
  • Translation quirks: NLLB produced occasional formatting/semicolon separation issues in path labels.

Core Entities

Models

  • Gemini 2.0 Flash
  • openai-gpt-4o-mini
  • Llama-3.3-70b-instruct

Metrics

  • semantic_similarity (sentence-embedding dot)
  • NLI entailment (mDeBERTa-xnli)
  • hallucination_detection (HHEM-2.1)
  • Spearman correlation (path score vs semantic score)

Datasets

  • MultiHal
  • HaluEval
  • HaluBench
  • DefAn
  • SimpleQA
  • TruthfulQA
  • Shroom2024
  • FELM
  • Wikidata

Benchmarks

  • HHEM-2.1
  • MMTE

Context Entities

Models

  • MiniLM-L12-v2 (sentence embeddings)

Datasets

  • HuggingFace MultiHal dataset
  • Wikidata KG