Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
You can get better entity linking for noisy, multilingual historical texts without labeled data by combining a fast retriever with selective LLM calls, cutting inference cost and reducing hallucinations.
Summary TLDR
The paper introduces MHEL-LLaMo, an unsupervised pipeline for historical multilingual entity linking that combines a multilingual bi-encoder (BELA) for fast candidate retrieval with instruction-tuned LLMs for NIL detection and final candidate selection. The system uses BELA's inner-product confidence to skip LLM inference on easy cases and run LLMs only on hard cases. On four historical benchmarks in six European languages, variants of MHEL-LLaMo improve F1 over prior specialized systems without fine-tuning. The code and data are publicly available.
Problem Statement
Historical texts are noisy, multilingual, and contain many entities missing from knowledge bases (NIL). Supervised or rule-heavy EL systems need labeled data and don't scale. The paper asks whether an unsupervised ensemble of a bi-encoder and LLMs can give robust, low-cost multilingual historical entity linking without fine-tuning.
Main Contribution
MHEL-LLaMo: an unsupervised ensemble that uses BELA for candidate retrieval and instruction-tuned LLMs for NIL decision and candidate selection.
An adaptive threshold on BELA inner products to classify easy vs hard mentions and call LLMs only on hard cases to cut cost and reduce hallucinations.
Evaluation on four historical EL benchmarks (HIPE-2020, NewsEye, AJMC, MHERCL) across six European languages, with code released on GitHub.
Key Findings
Adaptive ensemble with LLMs improves F1 on standard historical EL benchmarks.
Large gains on music periodicals vs zero-shot larger LLMs.
Bi-encoder confidence correlates with final correctness.
Prompt chaining helps NIL detection on NIL-heavy sets.
Results
HIPE-2020 English F1
MHERCL English F1
NewsEye French F1
NIL detection recall (HIPE-2020 English)
NIL detection recall (NewsEye Swedish)
Who Should Care
What To Try In 7 Days
Run a bi-encoder (BELA) + FAISS index for candidate retrieval on your corpus.
Compute inner-product confidence and set an adaptive threshold to triage easy mentions.
Use an instruction-tuned LLM as a reranker only for low-confidence mentions with a two-step NIL then selection prompt.
Optimization Features
Infra Optimization
- FAISS for fast nearest-neighbor retrieval
- Two NVIDIA L40S GPUs used in experiments
System Optimization
- Adaptive threshold reduces redundant LLM runs and hallucinations
- Single-run experiments used ~60 GPU hours total (paper budget)
Inference Optimization
- Call LLMs only for low-confidence (hard) mentions to lower GPU use
- Use BELA inner-product threshold to bypass LLM inference
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Lower performance on Nordic languages, notably Finnish and Swedish (e.g., NewsEye sv NIL recall 0.184).
- Weaker results on classical commentaries (AJMC) where NIL prevalence is low and entity types differ.
- Dependence on BELA embeddings trained on 2023 Wikipedia; KB evolution can invalidate gold annotations.
- No exploration of parameter-efficient fine-tuning (e.g., LoRA) to reduce LLM cost.
When Not To Use
- When the target language has poor open-source LLM support (e.g., Swedish/Finnish) without fine-tuning.
- When the domain is dominated by long-tail, very obscure entities and you can afford full LLM processing everywhere.
- When strict, certified KB-versioned annotations are required and KB drift is a concern.
Failure Modes
- LLM hallucination if called on easy mentions unnecessarily (mitigated by threshold but not eliminated).
- False NIL predictions or false positives in NIL-low domains (AJMC) when chain prompting adds noise.
- Errors due to OCR noise in mentions leading to wrong candidate retrieval.
- KB drift: gold annotations become inconsistent with current Wikidata entries.
Core Entities
Models
- BELA
- Mistral-Small-24B-Instruct
- Gemma-3-27B-it
- Poro-2-8B-Instruct
- GPT-4o mini
- LLaMa3-70B
- mGENRE
Metrics
- F1
- Precision
- Recall
- point-biserial correlation
Datasets
- HIPE-2020
- NewsEye
- AJMC
- MHERCL
Benchmarks
- HIPE-2020
- NewsEye
- AJMC
- MHERCL
Context Entities
Models
- mReFinED
- mGENRE
- MELHISSA
- SBB
- L3i
Datasets
- HIPE-2022 (related)
- KE-MHISTO (MHERCL source refs)

