Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Structured KG evidence can be injected into prompts to measurably reduce hallucinations and improve answer fidelity across languages, lowering risk for information-sensitive products.
Summary TLDR
MultiHal is a multilingual, multi-hop benchmark that links existing hallucination/QA datasets to Wikidata paths. The authors mined ~140k candidate KG paths, filtered them with an LLM-as-a-judge down to 25.9k high-quality paths covering 7,095 unique questions, then translated Q/A+paths into five European languages. Baseline tests show adding KG paths as in-context knowledge (KG-RAG) raises semantic similarity, NLI entailment, and hallucination-detection scores versus vanilla QA across models and languages. The dataset, code and data are public.
Problem Statement
Existing hallucination benchmarks are English-centric and text-based and do not use structured knowledge from knowledge graphs (KGs). This limits multilingual factuality evaluation and the testing of KG-based methods for reducing hallucinations in LLM outputs.
Main Contribution
A multilingual, multi-hop benchmark (MultiHal) that links QA/hallucination questions to Wikidata KG paths and translations.
A scalable pipeline: entity linking (Falcon 2.0 + DBpedia/Wikipedia mapping), SPARQL path mining (≤2 hops) and LLM-as-a-judge filtering.
Public release of data and code (CC-BY-4.0) including translations to German, Italian, French, Portuguese and Spanish.
Empirical baselines showing KG paths as in-context knowledge (KG-RAG) improve factuality metrics over vanilla QA.
Key Findings
Dataset scale and multilingual coverage
KG-RAG boosts factuality vs vanilla QA
LLM-as-a-judge filters paths but is noisy
Entity linking is a bottleneck
Results
semantic_similarity (mean dot-product)
NLI entailment (percent entailment)
Hallucination detection (HHEM-2.1 consistent %)
LLM-as-a-judge reliability
Who Should Care
What To Try In 7 Days
Run a small KG-RAG pilot: attach 1–2-hop Wikidata paths to common QA prompts and compare outputs vs vanilla QA.
Validate path quality manually on top 200 queries to catch entity-linking errors before scaling.
Use NLI + an open hallucination detector to triangulate improvements instead of relying on one metric.
Agent Features
Tool Use
- OpenRouter API
- SPARQL / Wikidata queries
- Falcon 2.0 entity linking
Reproducibility
License
- CC-BY-4.0 (data); code open-source (see repo); uses closed-source LLMs for judge
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies primarily on Wikidata — domain coverage gaps (medical, finance) reduce usefulness for specialized fields.
- Multilingual scope is top European languages only (DE/FR/IT/ES/PT), limited typological diversity.
- LLM-as-a-judge is closed-source and introduces noise (≈11% false positives).
- Evaluation metrics are aggregate and do not localize exact hallucinated text spans.
- KG-RAG injection approach is simple in-prompt conditioning; advanced knowledge encoding not evaluated.
When Not To Use
- When multi-turn dialogue or summarization tasks are primary (MultiHal is single-turn QA focused).
- For domain-specific QA (medical, legal) without domain-specific KGs.
- As the sole ground truth for path quality without human checks when high assurance is required.
Failure Modes
- Entity linking errors (Falcon 2.0 produced many irrelevant entities), producing wrong KG paths.
- Temporal/indexical questions where Wikidata is outdated or time-dependent answers shift.
- Suggestive/leading questions that require multi-step logical reasoning beyond short KG paths.
- Translation quirks: NLLB produced occasional formatting/semicolon separation issues in path labels.
Core Entities
Models
- Gemini 2.0 Flash
- openai-gpt-4o-mini
- Llama-3.3-70b-instruct
Metrics
- semantic_similarity (sentence-embedding dot)
- NLI entailment (mDeBERTa-xnli)
- hallucination_detection (HHEM-2.1)
- Spearman correlation (path score vs semantic score)
Datasets
- MultiHal
- HaluEval
- HaluBench
- DefAn
- SimpleQA
- TruthfulQA
- Shroom2024
- FELM
- Wikidata
Benchmarks
- HHEM-2.1
- MMTE
Context Entities
Models
- MiniLM-L12-v2 (sentence embeddings)
Datasets
- HuggingFace MultiHal dataset
- Wikidata KG

