Use LLMs (GPT-4 and local models) to match entities with far less labeled data and better robustness

October 17, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

8

Authors

Ralph Peeters, Aaron Steiner, Christian Bizer

Links

Abstract / PDF

Why It Matters For Business

LLMs can cut labeling needs and handle unseen entities better than fine-tuned PLMs, but costs, latency, and privacy tradeoffs matter; fine-tuning cheaper LLMs locally is a cost-effective alternative.

Summary TLDR

This paper tests large language models (LLMs) for entity matching and compares them to fine-tuned pre-trained language models (PLMs). Key findings: GPT-4 gives strong zero-shot matching (often matching or beating fine-tuned PLMs), LLMs generalize much better to unseen entities, in-context examples and textual rules can help some models but must be tuned per model and dataset, and light fine-tuning (or fine-tuning cheaper LLMs) often closes the gap at much lower cost. The authors also use GPT-4 to produce structured explanations and to auto-discover error classes to help debugging.

Problem Statement

Entity matching decides if two records describe the same real-world item. PLM-based matchers need lots of labeled pairs and tend to fail on unseen entities. The paper asks whether generative LLMs can match entities with less task-specific data and better robustness, and how prompts, demonstrations, rules, and fine-tuning affect this.

Main Contribution

Wide evaluation of zero-shot and few-shot prompts across hosted and open-source LLMs for entity matching.

Empirical finding that no single prompt works for all model/dataset combos; prompt must be tuned.

Comparison showing GPT-4 zero-shot often rivals or outperforms fine-tuned PLMs and that LLMs generalize better to unseen entities.

Analysis of in-context example selection, rule prompting, and fine-tuning (LoRA + 4-bit for local LLMs).

Use of GPT-4 to produce structured explanations and automated discovery/classification of error classes to help engineers debug.

Key Findings

GPT-4 achieves very strong zero-shot matching and often matches or beats fine-tuned PLMs

NumbersGPT-4 average F1 86.80; >=89% F1 on 5 of 6 datasets (zero-shot)

PLM-based matchers lose large accuracy when applied to unseen entities; LLMs are far more robust

NumbersRoBERTa transfer drops 22–61% F1; Ditto drops 36–56% F1 vs same-dataset fine-tune

In-context learning helps for many model/dataset combos but not universally

NumbersIn-context improved ~61% of tested model/dataset combinations; GPT-4 rarely improved, GPT-4o often improved

Fine-tuning local or cheaper LLMs often closes the gap to GPT-4 and can exceed it

NumbersFine-tuning gave +1–26% F1 on datasets; GPT-mini fine-tuned matched/exceeded GPT-4 on 4/6 datasets

Cost and latency vary hugely with prompt style and model; few-shot prompts can massively raise cost

NumbersFew-shot / rule prompts increased tokens by 1.3x–11x; cost up to 470x vs zero-shot GPT-mini in experiments

Results

GPT-4 average zero-shot F1

Value86.80

BaselineRoBERTa/Ditto fine-tuned

GPT-4 zero-shot peak dataset F1

Value>=89% on 5/6 datasets

Baselinefine-tuned PLMs on same datasets

PLM transfer drop (unseen entities)

Value-22% to -61% F1 (RoBERTa), -36% to -56% F1 (Ditto)

Baselinesame-model fine-tuned on target

Fine-tuning improvement range

Value+1% to +26% F1

Baselinemodel zero-shot best

Cost / token blow-up for few-shot/rule prompts

Valueprompt tokens increased 1.3x–11x; cost up to 470x (vs zero-shot GPT-mini)

Baselinezero-shot GPT-mini

Explanation similarity correlation

Value0.73–0.85 Pearson

Baselinestring similarity metrics (Cosine, Generalized Jaccard)

Who Should Care

What To Try In 7 Days

Run GPT-4 zero-shot on a small sample of your matching pairs to estimate baseline accuracy.

If cost is a concern, fine-tune a cheaper hosted model (GPT-mini) or an open Llama model with LoRA on a small labeled set.

Use GPT-4 to generate structured explanations on a sample of errors to discover quick normalization fixes (e.g., venue names, model codes).

Agent Features

Tool Use

  • langchain

Architectures

  • Transformer LLM prompting
  • PLM encoders (RoBERTa)

Optimization Features

Token Efficiency

  • Few-shot and rule prompts increase prompt tokens 1.3x–11x
  • Free-format answers increase completion tokens and runtime

Infra Optimization

  • Run open-source LLMs locally to avoid API costs and privacy issues

Model Optimization

  • LoRA
  • 4-bit quantization for 70B Llama models

System Optimization

  • Local GPU hosting (4x NVIDIA RTX6000) used for open models

Training Optimization

  • LoRA

Inference Optimization

  • force output format to shorten completions (lower tokens and latency)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Prompt sensitivity: best prompt varies by model and dataset; must be tuned.
  • Hosted LLM costs and token-based billing can be large for few-shot/rule prompts.
  • Some open-source models (e.g., Mixtral) lag in zero-shot performance and may need rules or fine-tuning.
  • Fine-tuning can reduce generalization for some models (noted for some Llama variants).

When Not To Use

  • If you have a large, stable labeled dataset and tight compute/latency constraints, a PLM fine-tuned matcher may be cheaper.
  • If deploying to a low-latency, low-cost edge without GPUs and you cannot host models locally.
  • If you cannot afford API costs and cannot host local GPUs and open models that reach required accuracy.

Failure Modes

  • Over-reliance on title similarity causing false positives in bibliographic data.
  • Model number or attribute mismatches causing false negatives for product matching.
  • Prompt wording changes causing large swings in some models' performance.
  • Automated explanations may misweight attributes; human verification recommended.

Core Entities

Models

  • GPT-4
  • gpt-4o
  • gpt-4o-mini (GPT-mini)
  • Llama-2-70b
  • Llama-3.1-70b
  • Mixtral-8x7B
  • RoBERTa-base
  • Ditto

Metrics

  • F1
  • precision
  • recall
  • Pearson correlation (for explanation similarities)

Datasets

  • WDCProducts
  • Abt-Buy
  • Walmart-Amazon
  • Amazon-Google
  • DBLP-Scholar
  • DBLP-ACM

Benchmarks

  • WDC Products (80% corner cases)
  • DeepMatcher subsets (Abt-Buy, Walmart-Amazon, Amazon-Google, DBLP splits)