MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

May 20, 20258 min

Overview

Decision SnapshotNeeds Validation

The dataset and baseline show clear per-model improvements when KG paths are injected; however entity linking noise and LLM-judge dependence reduce out-of-the-box reliability for safety-critical deployment.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Structured KG evidence can be injected into prompts to measurably reduce hallucinations and improve answer fidelity across languages, lowering risk for information-sensitive products.

Who Should Care

Summary TLDR

MultiHal is a multilingual, multi-hop benchmark that links existing hallucination/QA datasets to Wikidata paths. The authors mined ~140k candidate KG paths, filtered them with an LLM-as-a-judge down to 25.9k high-quality paths covering 7,095 unique questions, then translated Q/A+paths into five European languages. Baseline tests show adding KG paths as in-context knowledge (KG-RAG) raises semantic similarity, NLI entailment, and hallucination-detection scores versus vanilla QA across models and languages. The dataset, code and data are public.

Problem Statement

Existing hallucination benchmarks are English-centric and text-based and do not use structured knowledge from knowledge graphs (KGs). This limits multilingual factuality evaluation and the testing of KG-based methods for reducing hallucinations in LLM outputs.

Main Contribution

A multilingual, multi-hop benchmark (MultiHal) that links QA/hallucination questions to Wikidata KG paths and translations.

A scalable pipeline: entity linking (Falcon 2.0 + DBpedia/Wikipedia mapping), SPARQL path mining (≤2 hops) and LLM-as-a-judge filtering.

Key Findings

Dataset scale and multilingual coverage

Numbers25,905 KG-path data points; 7,095 unique questions; translations to 5 languages (+English)

Practical UseYou get a ready-made, multilingual KG-grounded testbed to compare knowledge-injection methods without building alignments yourself.

Evidence RefTable 1; Section 2.4

KG-RAG boosts factuality vs vanilla QA

Numberssemantic similarity +0.120.36; NLI entailment +0.160.36; hallucination detection +0.290.42 (KG-RAG vs QA)

Practical UseInjecting high-quality KG paths into prompts tends to produce objectively more fact-aligned answers across models and languages; try KG-RAG when factuality matters.

Evidence RefAbstract; Section 4; Table 10 and aggregated statements

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
semantic_similarity (mean dot-product)varies by model/language; e.g., Gemini Eng QA 0.51 → KG-RAG 0.83vanilla QA (no KG)+0.12 to +0.36 (aggregate range across models and languages)MultiHal (multilingual aggregate)Table 10; Figure 3; AbstractTable 10
NLI entailment (percent entailment)entailment rates increased per model (examples: GPT-4o-Mini QA 42.7% → KG-RAG 81.74% on some splits)vanilla QA+0.16 to +0.36 (entailment increase ranges reported)MultiHal aggregatedTable 5; Table 14Table 5

What To Try In 7 Days

Run a small KG-RAG pilot: attach 1–2-hop Wikidata paths to common QA prompts and compare outputs vs vanilla QA.

Validate path quality manually on top 200 queries to catch entity-linking errors before scaling.

Use NLI + an open hallucination detector to triangulate improvements instead of relying on one metric.

Agent Features

Tool Use
OpenRouter APISPARQL / Wikidata queriesFalcon 2.0 entity linking

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge

Risks & Boundaries

Limitations

Relies primarily on Wikidata — domain coverage gaps (medical, finance) reduce usefulness for specialized fields.

Multilingual scope is top European languages only (DE/FR/IT/ES/PT), limited typological diversity.

When Not To Use

When multi-turn dialogue or summarization tasks are primary (MultiHal is single-turn QA focused).

For domain-specific QA (medical, legal) without domain-specific KGs.

Failure Modes

Entity linking errors (Falcon 2.0 produced many irrelevant entities), producing wrong KG paths.

Temporal/indexical questions where Wikidata is outdated or time-dependent answers shift.

Core Entities

Models

Gemini 2.0 Flashopenai-gpt-4o-miniLlama-3.3-70b-instruct

Metrics

semantic_similarity (sentence-embedding dot)NLI entailment (mDeBERTa-xnli)hallucination_detection (HHEM-2.1)Spearman correlation (path score vs semantic score)

Datasets

MultiHalHaluEvalHaluBenchDefAnSimpleQATruthfulQAShroom2024FELMWikidata

Benchmarks

HHEM-2.1MMTE

Context Entities

Models

MiniLM-L12-v2 (sentence embeddings)

Datasets

HuggingFace MultiHal datasetWikidata KG