Overview
The dataset and baseline show clear per-model improvements when KG paths are injected; however entity linking noise and LLM-judge dependence reduce out-of-the-box reliability for safety-critical deployment.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
License: CC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Structured KG evidence can be injected into prompts to measurably reduce hallucinations and improve answer fidelity across languages, lowering risk for information-sensitive products.
Who Should Care
Summary TLDR
MultiHal is a multilingual, multi-hop benchmark that links existing hallucination/QA datasets to Wikidata paths. The authors mined ~140k candidate KG paths, filtered them with an LLM-as-a-judge down to 25.9k high-quality paths covering 7,095 unique questions, then translated Q/A+paths into five European languages. Baseline tests show adding KG paths as in-context knowledge (KG-RAG) raises semantic similarity, NLI entailment, and hallucination-detection scores versus vanilla QA across models and languages. The dataset, code and data are public.
Problem Statement
Existing hallucination benchmarks are English-centric and text-based and do not use structured knowledge from knowledge graphs (KGs). This limits multilingual factuality evaluation and the testing of KG-based methods for reducing hallucinations in LLM outputs.
Main Contribution
A multilingual, multi-hop benchmark (MultiHal) that links QA/hallucination questions to Wikidata KG paths and translations.
A scalable pipeline: entity linking (Falcon 2.0 + DBpedia/Wikipedia mapping), SPARQL path mining (≤2 hops) and LLM-as-a-judge filtering.
Key Findings
Dataset scale and multilingual coverage
KG-RAG boosts factuality vs vanilla QA
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| semantic_similarity (mean dot-product) | varies by model/language; e.g., Gemini Eng QA 0.51 → KG-RAG 0.83 | vanilla QA (no KG) | +0.12 to +0.36 (aggregate range across models and languages) | MultiHal (multilingual aggregate) | Table 10; Figure 3; Abstract | Table 10 |
| NLI entailment (percent entailment) | entailment rates increased per model (examples: GPT-4o-Mini QA 42.7% → KG-RAG 81.74% on some splits) | vanilla QA | +0.16 to +0.36 (entailment increase ranges reported) | MultiHal aggregated | Table 5; Table 14 | Table 5 |
What To Try In 7 Days
Run a small KG-RAG pilot: attach 1–2-hop Wikidata paths to common QA prompts and compare outputs vs vanilla QA.
Validate path quality manually on top 200 queries to catch entity-linking errors before scaling.
Use NLI + an open hallucination detector to triangulate improvements instead of relying on one metric.
Agent Features
Tool Use
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Relies primarily on Wikidata — domain coverage gaps (medical, finance) reduce usefulness for specialized fields.
Multilingual scope is top European languages only (DE/FR/IT/ES/PT), limited typological diversity.
When Not To Use
When multi-turn dialogue or summarization tasks are primary (MultiHal is single-turn QA focused).
For domain-specific QA (medical, legal) without domain-specific KGs.
Failure Modes
Entity linking errors (Falcon 2.0 produced many irrelevant entities), producing wrong KG paths.
Temporal/indexical questions where Wikidata is outdated or time-dependent answers shift.

