Use bi-encoder confidence to call an LLM only on hard historical entity links

January 13, 20267 min

Overview

Decision SnapshotNeeds Validation

The method is practical: public code and datasets, clear hyperparameters, and consistent improvements on multiple benchmarks. Performance varies by language and genre, requiring careful threshold tuning.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Cristian Santini, Marieke Van Erp, Mehwish Alam

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get better entity linking for noisy, multilingual historical texts without labeled data by combining a fast retriever with selective LLM calls, cutting inference cost and reducing hallucinations.

Who Should Care

Summary TLDR

The paper introduces MHEL-LLaMo, an unsupervised pipeline for historical multilingual entity linking that combines a multilingual bi-encoder (BELA) for fast candidate retrieval with instruction-tuned LLMs for NIL detection and final candidate selection. The system uses BELA's inner-product confidence to skip LLM inference on easy cases and run LLMs only on hard cases. On four historical benchmarks in six European languages, variants of MHEL-LLaMo improve F1 over prior specialized systems without fine-tuning. The code and data are publicly available.

Problem Statement

Historical texts are noisy, multilingual, and contain many entities missing from knowledge bases (NIL). Supervised or rule-heavy EL systems need labeled data and don't scale. The paper asks whether an unsupervised ensemble of a bi-encoder and LLMs can give robust, low-cost multilingual historical entity linking without fine-tuning.

Main Contribution

MHEL-LLaMo: an unsupervised ensemble that uses BELA for candidate retrieval and instruction-tuned LLMs for NIL decision and candidate selection.

An adaptive threshold on BELA inner products to classify easy vs hard mentions and call LLMs only on hard cases to cut cost and reduce hallucinations.

Key Findings

Adaptive ensemble with LLMs improves F1 on standard historical EL benchmarks.

Numbers0.723 F1 on HIPE-2020 (English, MHEL-LLaMo van chain)

Practical UseUse a bi-encoder + LLM pipeline and call the LLM only when retrieval confidence is low to boost accuracy on historical news and periodicals.

Evidence RefTable 2

Large gains on music periodicals vs zero-shot larger LLMs.

NumbersMHERCL: 0.700 F1 (English) vs GPT-4o mini 0.60 and LLaMA3 0.61

Practical UseFor specialized historical domains, a retrieval-plus-instruct-LLM ensemble can beat larger zero-shot LLMs while staying unsupervised.

Evidence RefTable 2, MHERCL rows

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HIPE-2020 English F10.723 (MHEL-LLaMo van, chain)MELHISSA 0.597+0.126HIPE-2020 (en)Table 2 shows MHEL-LLaMo van (chain) 0.723 vs MELHISSA 0.597Table 2
MHERCL English F10.700 (MHEL-LLaMo van, single/chain)GPT-4o mini 0.60; LLAMA 3.3 70B 0.61+0.09 to +0.10 (per-language); paper claims ~27% average vs larger modelsMHERCL (en)Table 2 MHERCL rowsTable 2

What To Try In 7 Days

Run a bi-encoder (BELA) + FAISS index for candidate retrieval on your corpus.

Compute inner-product confidence and set an adaptive threshold to triage easy mentions.

Use an instruction-tuned LLM as a reranker only for low-confidence mentions with a two-step NIL then selection prompt.

Optimization Features

Infra Optimization
FAISS for fast nearest-neighbor retrievalTwo NVIDIA L40S GPUs used in experiments
System Optimization
Adaptive threshold reduces redundant LLM runs and hallucinationsSingle-run experiments used ~60 GPU hours total (paper budget)
Inference Optimization
Call LLMs only for low-confidence (hard) mentions to lower GPU useUse BELA inner-product threshold to bypass LLM inference

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Lower performance on Nordic languages, notably Finnish and Swedish (e.g., NewsEye sv NIL recall 0.184).

Weaker results on classical commentaries (AJMC) where NIL prevalence is low and entity types differ.

When Not To Use

When the target language has poor open-source LLM support (e.g., Swedish/Finnish) without fine-tuning.

When the domain is dominated by long-tail, very obscure entities and you can afford full LLM processing everywhere.

Failure Modes

LLM hallucination if called on easy mentions unnecessarily (mitigated by threshold but not eliminated).

False NIL predictions or false positives in NIL-low domains (AJMC) when chain prompting adds noise.

Core Entities

Models

BELAMistral-Small-24B-InstructGemma-3-27B-itPoro-2-8B-InstructGPT-4o miniLLaMa3-70BmGENRE

Metrics

F1PrecisionRecallpoint-biserial correlation

Datasets

HIPE-2020NewsEyeAJMCMHERCL

Benchmarks

HIPE-2020NewsEyeAJMCMHERCL

Context Entities

Models

mReFinEDmGENREMELHISSASBBL3i

Datasets

HIPE-2022 (related)KE-MHISTO (MHERCL source refs)