Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
LLMs let you do pairwise ER without labeled training data and with lower engineering effort; using short prompts can cut API costs substantially but you must combine LLMs with blocking to control scale.
Summary TLDR
This paper tests six prompt designs for using GPT‑3.5 as an unsupervised similarity function for product entity resolution (ER). On two e‑commerce benchmarks (WDC, Amazon‑Google) GPT‑3.5 achieves competitive F1 scores (many prompts 0.8+). Simpler prompts (single-attribute/title) often match or beat costlier ones while costing ~30–40% less in tokens. JSON-structured prompts reduced accuracy. Similarity-score prompts can be powerful but are unstable across datasets. The study assumes perfect blocking and highlights scale/cost limits and failure modes (model-number errors, hallucinations).
Problem Statement
Can an off‑the‑shelf LLM (GPT‑3.5) serve as a high‑quality, low‑cost unsupervised similarity function for entity resolution, and how do different prompt designs trade off accuracy and token cost? The study focuses on the matching (similarity) step only and evaluates six prompt patterns on two e‑commerce benchmarks.
Main Contribution
Systematic comparison of six prompt patterns for GPT‑3.5 used as an unsupervised ER similarity function.
Quantified cost vs. performance tradeoffs on two public product ER benchmarks (WDC, Amazon‑Google).
Qualitative error analysis highlighting common failure modes (model numbers, technical specs, hallucinations).
Analysis of inter-method disagreement and statistical significance of prompt effects.
Public release of raw data and analysis artifacts for replication.
Key Findings
GPT‑3.5 is viable as an unsupervised ER similarity function on product data.
Simple prompts often give similar or better accuracy at lower cost.
Structured JSON prompts reduced accuracy in these tests.
Similarity-score prompts are powerful but inconsistent across datasets.
Scale and blocking dominate operational cost.
Results
WDC F1 (multi-attr)
WDC F1 (single-attr)
Amazon-Google F1 (multi-attr)
Amazon-Google F1 (single-attr)
Format effect (multi-json vs multi-attr)
Who Should Care
What To Try In 7 Days
Run a pilot: apply single-attribute prompts (title) on a blocked candidate set and measure precision/recall.
Compare single-attr vs multi-attr on your data and log token cost per pair.
If similarity scores help, tune a decision threshold on a small labeled sample before scaling up.
Optimization Features
Token Efficiency
- Single-attr prompt reduced per-pair token cost by ~37% vs multi-attr (WDC example)
- Cost scales roughly with input token count; persona and few-shot increase tokens
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Study focuses only on the similarity step and assumes perfect blocking.
- Only GPT‑3.5 was evaluated; other LLMs may behave differently.
- JSON formatting and extra attributes sometimes harmed accuracy.
- Model confuses fine-grained identifiers (model numbers) and can hallucinate explanations.
- Cost estimates ignore full cross-product pair explosion without blocking.
When Not To Use
- When you cannot produce effective blocking and pairs explode in number.
- When exact model-number or technical-spec matching is required.
- If strict, auditable explanations are required and LLM hallucination is unacceptable.
Failure Modes
- Hallucinated or incorrect reasoning about identifiers and specs.
- Prompt-format sensitivity causing inconsistent outputs across prompt patterns.
- Similarity scores that require per-dataset threshold tuning.
- All methods agreeing but all being wrong on hard pairs.
Core Entities
Models
- GPT-3.5
Metrics
- precision
- recall
- F1 (F‑Measure)
Datasets
- WDC (Web Data Commons, computer subset)
- Amazon-Google Products

