Cheap prompts often match expensive ones: GPT‑3.5 can do unsupervised product entity resolution cost‑efficiently

Overview

Decision SnapshotNeeds Validation

The approach is promising for the matching step and can save engineering effort, but needs reliable blocking and dataset-level validation because results and costs vary by prompt and dataset.

Citations2

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Navapat Nananukul, Khanin Sisaengsuwanchai, Mayank Kejriwal

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs let you do pairwise ER without labeled training data and with lower engineering effort; using short prompts can cut API costs substantially but you must combine LLMs with blocking to control scale.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper tests six prompt designs for using GPT‑3.5 as an unsupervised similarity function for product entity resolution (ER). On two e‑commerce benchmarks (WDC, Amazon‑Google) GPT‑3.5 achieves competitive F1 scores (many prompts 0.8+). Simpler prompts (single-attribute/title) often match or beat costlier ones while costing ~30–40% less in tokens. JSON-structured prompts reduced accuracy. Similarity-score prompts can be powerful but are unstable across datasets. The study assumes perfect blocking and highlights scale/cost limits and failure modes (model-number errors, hallucinations).

Problem Statement

Can an off‑the‑shelf LLM (GPT‑3.5) serve as a high‑quality, low‑cost unsupervised similarity function for entity resolution, and how do different prompt designs trade off accuracy and token cost? The study focuses on the matching (similarity) step only and evaluates six prompt patterns on two e‑commerce benchmarks.

Main Contribution

Systematic comparison of six prompt patterns for GPT‑3.5 used as an unsupervised ER similarity function.

Quantified cost vs. performance tradeoffs on two public product ER benchmarks (WDC, Amazon‑Google).

Key Findings

GPT‑3.5 is viable as an unsupervised ER similarity function on product data.

NumbersMany prompt patterns achieved F1 ≥ 0.80; examples: WDC single-attr F1=0.93, AG multi-sim F1=0.95

Practical UseYou can use GPT‑3.5 directly for pairwise product matching without training a classifier, especially for medium‑scale problems and when blocking reduces pairs.

Evidence RefTable 2; Section 5.1

Simple prompts often give similar or better accuracy at lower cost.

NumbersSingle-attr vs multi-attr: WDC F1 0.93 vs 0.91; cost $0.59 vs $0.93 (≈37% lower)

Practical UseStart with a single‑attribute (title) prompt when a high‑signal attribute exists; it cuts token cost substantially with little accuracy loss on these datasets.

Evidence RefTable 2; Section 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WDC F1 (multi-attr)	0.91	—	—	WDC	Table 2: multi-attr F1=0.91, cost $0.93	Table 2
WDC F1 (single-attr)	0.93	multi-attr	+0.02	WDC	Table 2: single-attr F1=0.93, cost $0.59	Table 2

What To Try In 7 Days

Run a pilot: apply single-attribute prompts (title) on a blocked candidate set and measure precision/recall.

Compare single-attr vs multi-attr on your data and log token cost per pair.

If similarity scores help, tune a decision threshold on a small labeled sample before scaling up.

Optimization Features

Token Efficiency

Single-attr prompt reduced per-pair token cost by ~37% vs multi-attr (WDC example)Cost scales roughly with input token count; persona and few-shot increase tokens

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://drive.google.com/drive/folders/18taqVQ8oJeNunMb6nzqZNZ1EnYy7JCF

Risks & Boundaries

Limitations

Study focuses only on the similarity step and assumes perfect blocking.

Only GPT‑3.5 was evaluated; other LLMs may behave differently.

When Not To Use

When you cannot produce effective blocking and pairs explode in number.

When exact model-number or technical-spec matching is required.

Failure Modes

Hallucinated or incorrect reasoning about identifiers and specs.

Prompt-format sensitivity causing inconsistent outputs across prompt patterns.

Core Entities

Models

GPT-3.5

Metrics

precisionrecallF1 (F‑Measure)

Datasets

WDC (Web Data Commons, computer subset)Amazon-Google Products

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT‑3.5 is viable as an unsupervised ER similarity function on product data.

Simple prompts often give similar or better accuracy at lower cost.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding