MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Overview

Decision SnapshotNeeds Validation

The dataset and baseline show clear per-model improvements when KG paths are injected; however entity linking noise and LLM-judge dependence reduce out-of-the-box reliability for safety-critical deployment.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Structured KG evidence can be injected into prompts to measurably reduce hallucinations and improve answer fidelity across languages, lowering risk for information-sensitive products.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

MultiHal is a multilingual, multi-hop benchmark that links existing hallucination/QA datasets to Wikidata paths. The authors mined ~140k candidate KG paths, filtered them with an LLM-as-a-judge down to 25.9k high-quality paths covering 7,095 unique questions, then translated Q/A+paths into five European languages. Baseline tests show adding KG paths as in-context knowledge (KG-RAG) raises semantic similarity, NLI entailment, and hallucination-detection scores versus vanilla QA across models and languages. The dataset, code and data are public.

Problem Statement

Existing hallucination benchmarks are English-centric and text-based and do not use structured knowledge from knowledge graphs (KGs). This limits multilingual factuality evaluation and the testing of KG-based methods for reducing hallucinations in LLM outputs.

Main Contribution

A multilingual, multi-hop benchmark (MultiHal) that links QA/hallucination questions to Wikidata KG paths and translations.

A scalable pipeline: entity linking (Falcon 2.0 + DBpedia/Wikipedia mapping), SPARQL path mining (≤2 hops) and LLM-as-a-judge filtering.

Key Findings

Dataset scale and multilingual coverage

Numbers25,905 KG-path data points; 7,095 unique questions; translations to 5 languages (+English)

Practical UseYou get a ready-made, multilingual KG-grounded testbed to compare knowledge-injection methods without building alignments yourself.

Evidence RefTable 1; Section 2.4

KG-RAG boosts factuality vs vanilla QA

Numberssemantic similarity +0.12–0.36; NLI entailment +0.16–0.36; hallucination detection +0.29–0.42 (KG-RAG vs QA)

Practical UseInjecting high-quality KG paths into prompts tends to produce objectively more fact-aligned answers across models and languages; try KG-RAG when factuality matters.

Evidence RefAbstract; Section 4; Table 10 and aggregated statements

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
semantic_similarity (mean dot-product)	varies by model/language; e.g., Gemini Eng QA 0.51 → KG-RAG 0.83	vanilla QA (no KG)	+0.12 to +0.36 (aggregate range across models and languages)	MultiHal (multilingual aggregate)	Table 10; Figure 3; Abstract	Table 10
NLI entailment (percent entailment)	entailment rates increased per model (examples: GPT-4o-Mini QA 42.7% → KG-RAG 81.74% on some splits)	vanilla QA	+0.16 to +0.36 (entailment increase ranges reported)	MultiHal aggregated	Table 5; Table 14	Table 5

What To Try In 7 Days

Run a small KG-RAG pilot: attach 1–2-hop Wikidata paths to common QA prompts and compare outputs vs vanilla QA.

Validate path quality manually on top 200 queries to catch entity-linking errors before scaling.

Use NLI + an open hallucination detector to triangulate improvements instead of relying on one metric.

Agent Features

Tool Use

OpenRouter APISPARQL / Wikidata queriesFalcon 2.0 entity linking

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge

Code URLs

https://github.com/ernlavr/multihal

Data URLs

https://huggingface.co/datasets/ernlavr/multihal

Risks & Boundaries

Limitations

Relies primarily on Wikidata — domain coverage gaps (medical, finance) reduce usefulness for specialized fields.

Multilingual scope is top European languages only (DE/FR/IT/ES/PT), limited typological diversity.

When Not To Use

When multi-turn dialogue or summarization tasks are primary (MultiHal is single-turn QA focused).

For domain-specific QA (medical, legal) without domain-specific KGs.

Failure Modes

Entity linking errors (Falcon 2.0 produced many irrelevant entities), producing wrong KG paths.

Temporal/indexical questions where Wikidata is outdated or time-dependent answers shift.

Core Entities

Models

Gemini 2.0 Flashopenai-gpt-4o-miniLlama-3.3-70b-instruct

Metrics

semantic_similarity (sentence-embedding dot)NLI entailment (mDeBERTa-xnli)hallucination_detection (HHEM-2.1)Spearman correlation (path score vs semantic score)

Datasets

MultiHalHaluEvalHaluBenchDefAnSimpleQATruthfulQAShroom2024FELMWikidata

Benchmarks

HHEM-2.1MMTE

Context Entities

Models

MiniLM-L12-v2 (sentence embeddings)

Datasets

HuggingFace MultiHal datasetWikidata KG

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset scale and multilingual coverage

KG-RAG boosts factuality vs vanilla QA

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding