Use bi-encoder confidence to call an LLM only on hard historical entity links

Overview

Decision SnapshotNeeds Validation

The method is practical: public code and datasets, clear hyperparameters, and consistent improvements on multiple benchmarks. Performance varies by language and genre, requiring careful threshold tuning.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Cristian Santini, Marieke Van Erp, Mehwish Alam

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get better entity linking for noisy, multilingual historical texts without labeled data by combining a fast retriever with selective LLM calls, cutting inference cost and reducing hallucinations.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The paper introduces MHEL-LLaMo, an unsupervised pipeline for historical multilingual entity linking that combines a multilingual bi-encoder (BELA) for fast candidate retrieval with instruction-tuned LLMs for NIL detection and final candidate selection. The system uses BELA's inner-product confidence to skip LLM inference on easy cases and run LLMs only on hard cases. On four historical benchmarks in six European languages, variants of MHEL-LLaMo improve F1 over prior specialized systems without fine-tuning. The code and data are publicly available.

Problem Statement

Historical texts are noisy, multilingual, and contain many entities missing from knowledge bases (NIL). Supervised or rule-heavy EL systems need labeled data and don't scale. The paper asks whether an unsupervised ensemble of a bi-encoder and LLMs can give robust, low-cost multilingual historical entity linking without fine-tuning.

Main Contribution

MHEL-LLaMo: an unsupervised ensemble that uses BELA for candidate retrieval and instruction-tuned LLMs for NIL decision and candidate selection.

An adaptive threshold on BELA inner products to classify easy vs hard mentions and call LLMs only on hard cases to cut cost and reduce hallucinations.

Key Findings

Adaptive ensemble with LLMs improves F1 on standard historical EL benchmarks.

Numbers0.723 F1 on HIPE-2020 (English, MHEL-LLaMo van chain)

Practical UseUse a bi-encoder + LLM pipeline and call the LLM only when retrieval confidence is low to boost accuracy on historical news and periodicals.

Evidence RefTable 2

Large gains on music periodicals vs zero-shot larger LLMs.

NumbersMHERCL: 0.700 F1 (English) vs GPT-4o mini 0.60 and LLaMA3 0.61

Practical UseFor specialized historical domains, a retrieval-plus-instruct-LLM ensemble can beat larger zero-shot LLMs while staying unsupervised.

Evidence RefTable 2, MHERCL rows

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HIPE-2020 English F1	0.723 (MHEL-LLaMo van, chain)	MELHISSA 0.597	+0.126	HIPE-2020 (en)	Table 2 shows MHEL-LLaMo van (chain) 0.723 vs MELHISSA 0.597	Table 2
MHERCL English F1	0.700 (MHEL-LLaMo van, single/chain)	GPT-4o mini 0.60; LLAMA 3.3 70B 0.61	+0.09 to +0.10 (per-language); paper claims ~27% average vs larger models	MHERCL (en)	Table 2 MHERCL rows	Table 2

What To Try In 7 Days

Run a bi-encoder (BELA) + FAISS index for candidate retrieval on your corpus.

Compute inner-product confidence and set an adaptive threshold to triage easy mentions.

Use an instruction-tuned LLM as a reranker only for low-confidence mentions with a two-step NIL then selection prompt.

Optimization Features

Infra Optimization

FAISS for fast nearest-neighbor retrievalTwo NVIDIA L40S GPUs used in experiments

System Optimization

Adaptive threshold reduces redundant LLM runs and hallucinationsSingle-run experiments used ~60 GPU hours total (paper budget)

Inference Optimization

Call LLMs only for low-confidence (hard) mentions to lower GPU useUse BELA inner-product threshold to bypass LLM inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/sntcristian/MHEL-LLAMO

Data URLs

https://github.com/hipe-eval/HIPE-2022-data https://github.com/polifonia-project/KE-MHISTO

Risks & Boundaries

Limitations

Lower performance on Nordic languages, notably Finnish and Swedish (e.g., NewsEye sv NIL recall 0.184).

Weaker results on classical commentaries (AJMC) where NIL prevalence is low and entity types differ.

When Not To Use

When the target language has poor open-source LLM support (e.g., Swedish/Finnish) without fine-tuning.

When the domain is dominated by long-tail, very obscure entities and you can afford full LLM processing everywhere.

Failure Modes

LLM hallucination if called on easy mentions unnecessarily (mitigated by threshold but not eliminated).

False NIL predictions or false positives in NIL-low domains (AJMC) when chain prompting adds noise.

Core Entities

Models

BELAMistral-Small-24B-InstructGemma-3-27B-itPoro-2-8B-InstructGPT-4o miniLLaMa3-70BmGENRE

Metrics

F1PrecisionRecallpoint-biserial correlation

Datasets

HIPE-2020NewsEyeAJMCMHERCL

Benchmarks

HIPE-2020NewsEyeAJMCMHERCL

Context Entities

Models

mReFinEDmGENREMELHISSASBBL3i

Datasets

HIPE-2022 (related)KE-MHISTO (MHERCL source refs)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adaptive ensemble with LLMs improves F1 on standard historical EL benchmarks.

Large gains on music periodicals vs zero-shot larger LLMs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding