Use bi-encoder confidence to call an LLM only on hard historical entity links

January 13, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Cristian Santini, Marieke Van Erp, Mehwish Alam

Links

Abstract / PDF

Why It Matters For Business

You can get better entity linking for noisy, multilingual historical texts without labeled data by combining a fast retriever with selective LLM calls, cutting inference cost and reducing hallucinations.

Summary TLDR

The paper introduces MHEL-LLaMo, an unsupervised pipeline for historical multilingual entity linking that combines a multilingual bi-encoder (BELA) for fast candidate retrieval with instruction-tuned LLMs for NIL detection and final candidate selection. The system uses BELA's inner-product confidence to skip LLM inference on easy cases and run LLMs only on hard cases. On four historical benchmarks in six European languages, variants of MHEL-LLaMo improve F1 over prior specialized systems without fine-tuning. The code and data are publicly available.

Problem Statement

Historical texts are noisy, multilingual, and contain many entities missing from knowledge bases (NIL). Supervised or rule-heavy EL systems need labeled data and don't scale. The paper asks whether an unsupervised ensemble of a bi-encoder and LLMs can give robust, low-cost multilingual historical entity linking without fine-tuning.

Main Contribution

MHEL-LLaMo: an unsupervised ensemble that uses BELA for candidate retrieval and instruction-tuned LLMs for NIL decision and candidate selection.

An adaptive threshold on BELA inner products to classify easy vs hard mentions and call LLMs only on hard cases to cut cost and reduce hallucinations.

Evaluation on four historical EL benchmarks (HIPE-2020, NewsEye, AJMC, MHERCL) across six European languages, with code released on GitHub.

Key Findings

Adaptive ensemble with LLMs improves F1 on standard historical EL benchmarks.

Numbers0.723 F1 on HIPE-2020 (English, MHEL-LLaMo van chain)

Large gains on music periodicals vs zero-shot larger LLMs.

NumbersMHERCL: 0.700 F1 (English) vs GPT-4o mini 0.60 and LLaMA3 0.61

Bi-encoder confidence correlates with final correctness.

Numberspoint-biserial r_pb up to 0.66 (AJMC) and generally positive (p<0.001)

Prompt chaining helps NIL detection on NIL-heavy sets.

NumbersHigher NIL recall: HIPE-2020 en recall 0.801 using chain prompts

Results

HIPE-2020 English F1

Value0.723 (MHEL-LLaMo van, chain)

BaselineMELHISSA 0.597

MHERCL English F1

Value0.700 (MHEL-LLaMo van, single/chain)

BaselineGPT-4o mini 0.60; LLAMA 3.3 70B 0.61

NewsEye French F1

Value0.662 (MHEL-LLaMo θ, chain)

BaselineMELHISSA 0.542

NIL detection recall (HIPE-2020 English)

Value0.801 (chain prompts)

NIL detection recall (NewsEye Swedish)

Value0.184 (low)

Who Should Care

What To Try In 7 Days

Run a bi-encoder (BELA) + FAISS index for candidate retrieval on your corpus.

Compute inner-product confidence and set an adaptive threshold to triage easy mentions.

Use an instruction-tuned LLM as a reranker only for low-confidence mentions with a two-step NIL then selection prompt.

Optimization Features

Infra Optimization

  • FAISS for fast nearest-neighbor retrieval
  • Two NVIDIA L40S GPUs used in experiments

System Optimization

  • Adaptive threshold reduces redundant LLM runs and hallucinations
  • Single-run experiments used ~60 GPU hours total (paper budget)

Inference Optimization

  • Call LLMs only for low-confidence (hard) mentions to lower GPU use
  • Use BELA inner-product threshold to bypass LLM inference

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Lower performance on Nordic languages, notably Finnish and Swedish (e.g., NewsEye sv NIL recall 0.184).
  • Weaker results on classical commentaries (AJMC) where NIL prevalence is low and entity types differ.
  • Dependence on BELA embeddings trained on 2023 Wikipedia; KB evolution can invalidate gold annotations.
  • No exploration of parameter-efficient fine-tuning (e.g., LoRA) to reduce LLM cost.

When Not To Use

  • When the target language has poor open-source LLM support (e.g., Swedish/Finnish) without fine-tuning.
  • When the domain is dominated by long-tail, very obscure entities and you can afford full LLM processing everywhere.
  • When strict, certified KB-versioned annotations are required and KB drift is a concern.

Failure Modes

  • LLM hallucination if called on easy mentions unnecessarily (mitigated by threshold but not eliminated).
  • False NIL predictions or false positives in NIL-low domains (AJMC) when chain prompting adds noise.
  • Errors due to OCR noise in mentions leading to wrong candidate retrieval.
  • KB drift: gold annotations become inconsistent with current Wikidata entries.

Core Entities

Models

  • BELA
  • Mistral-Small-24B-Instruct
  • Gemma-3-27B-it
  • Poro-2-8B-Instruct
  • GPT-4o mini
  • LLaMa3-70B
  • mGENRE

Metrics

  • F1
  • Precision
  • Recall
  • point-biserial correlation

Datasets

  • HIPE-2020
  • NewsEye
  • AJMC
  • MHERCL

Benchmarks

  • HIPE-2020
  • NewsEye
  • AJMC
  • MHERCL

Context Entities

Models

  • mReFinED
  • mGENRE
  • MELHISSA
  • SBB
  • L3i

Datasets

  • HIPE-2022 (related)
  • KE-MHISTO (MHERCL source refs)