Use LLMs (GPT-4 and local models) to match entities with far less labeled data and better robustness

October 17, 20239 min

Overview

Decision SnapshotNeeds Validation

Solid empirical evaluation across multiple datasets and models; results are concrete but cost and latency tradeoffs need evaluation per project.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Ralph Peeters, Aaron Steiner, Christian Bizer

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can cut labeling needs and handle unseen entities better than fine-tuned PLMs, but costs, latency, and privacy tradeoffs matter; fine-tuning cheaper LLMs locally is a cost-effective alternative.

Who Should Care

Summary TLDR

This paper tests large language models (LLMs) for entity matching and compares them to fine-tuned pre-trained language models (PLMs). Key findings: GPT-4 gives strong zero-shot matching (often matching or beating fine-tuned PLMs), LLMs generalize much better to unseen entities, in-context examples and textual rules can help some models but must be tuned per model and dataset, and light fine-tuning (or fine-tuning cheaper LLMs) often closes the gap at much lower cost. The authors also use GPT-4 to produce structured explanations and to auto-discover error classes to help debugging.

Problem Statement

Entity matching decides if two records describe the same real-world item. PLM-based matchers need lots of labeled pairs and tend to fail on unseen entities. The paper asks whether generative LLMs can match entities with less task-specific data and better robustness, and how prompts, demonstrations, rules, and fine-tuning affect this.

Main Contribution

Wide evaluation of zero-shot and few-shot prompts across hosted and open-source LLMs for entity matching.

Empirical finding that no single prompt works for all model/dataset combos; prompt must be tuned.

Key Findings

GPT-4 achieves very strong zero-shot matching and often matches or beats fine-tuned PLMs

NumbersGPT-4 average F1 86.80; >=89% F1 on 5 of 6 datasets (zero-shot)

Practical UseTry GPT-4 zero-shot before collecting large labeled sets; it can save labeling cost and reach PLM-like accuracy on many product and publication benchmarks.

Evidence RefTable 3 and Table 4

PLM-based matchers lose large accuracy when applied to unseen entities; LLMs are far more robust

NumbersRoBERTa transfer drops 2261% F1; Ditto drops 3656% F1 vs same-dataset fine-tune

Practical UseIf your workload has many unseen entities, prefer LLM-based matching or plan to re-label/finetune frequently for PLMs.

Evidence RefTable 4 (RoBERTa unseen / Ditto unseen rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4 average zero-shot F186.80RoBERTa/Ditto fine-tunedcomparable or higher on many datasetsAverage over six datasets (zero-shot)Table 3 shows GPT-4 mean F1 86.80 across promptsTable 3
GPT-4 zero-shot peak dataset F1>=89% on 5/6 datasetsfine-tuned PLMs on same datasetsGPT-4 outperforms PLMs on 3/6; comparable on othersPer-dataset results in Table 4Table 4 per-dataset F1 (e.g., 89.61 on WDC)Table 4

What To Try In 7 Days

Run GPT-4 zero-shot on a small sample of your matching pairs to estimate baseline accuracy.

If cost is a concern, fine-tune a cheaper hosted model (GPT-mini) or an open Llama model with LoRA on a small labeled set.

Use GPT-4 to generate structured explanations on a sample of errors to discover quick normalization fixes (e.g., venue names, model codes).

Agent Features

Tool Use
langchain
Architectures
Transformer LLM promptingPLM encoders (RoBERTa)

Optimization Features

Token Efficiency
Few-shot and rule prompts increase prompt tokens 1.3x–11xFree-format answers increase completion tokens and runtime
Infra Optimization
Run open-source LLMs locally to avoid API costs and privacy issues
Model Optimization
LoRA4-bit quantization for 70B Llama models
System Optimization
Local GPU hosting (4x NVIDIA RTX6000) used for open models
Training Optimization
LoRA
Inference Optimization
force output format to shorten completions (lower tokens and latency)

Reproducibility

Risks & Boundaries

Limitations

Prompt sensitivity: best prompt varies by model and dataset; must be tuned.

Hosted LLM costs and token-based billing can be large for few-shot/rule prompts.

When Not To Use

If you have a large, stable labeled dataset and tight compute/latency constraints, a PLM fine-tuned matcher may be cheaper.

If deploying to a low-latency, low-cost edge without GPUs and you cannot host models locally.

Failure Modes

Over-reliance on title similarity causing false positives in bibliographic data.

Model number or attribute mismatches causing false negatives for product matching.

Core Entities

Models

GPT-4gpt-4ogpt-4o-mini (GPT-mini)Llama-2-70bLlama-3.1-70bMixtral-8x7BRoBERTa-baseDitto

Metrics

F1precisionrecallPearson correlation (for explanation similarities)

Datasets

WDCProductsAbt-BuyWalmart-AmazonAmazon-GoogleDBLP-ScholarDBLP-ACM

Benchmarks

WDC Products (80% corner cases)DeepMatcher subsets (Abt-Buy, Walmart-Amazon, Amazon-Google, DBLP splits)