Cheap prompts often match expensive ones: GPT‑3.5 can do unsupervised product entity resolution cost‑efficiently

October 9, 20237 min

Overview

Decision SnapshotNeeds Validation

The approach is promising for the matching step and can save engineering effort, but needs reliable blocking and dataset-level validation because results and costs vary by prompt and dataset.

Citations2

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Navapat Nananukul, Khanin Sisaengsuwanchai, Mayank Kejriwal

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs let you do pairwise ER without labeled training data and with lower engineering effort; using short prompts can cut API costs substantially but you must combine LLMs with blocking to control scale.

Who Should Care

Summary TLDR

This paper tests six prompt designs for using GPT‑3.5 as an unsupervised similarity function for product entity resolution (ER). On two e‑commerce benchmarks (WDC, Amazon‑Google) GPT‑3.5 achieves competitive F1 scores (many prompts 0.8+). Simpler prompts (single-attribute/title) often match or beat costlier ones while costing ~30–40% less in tokens. JSON-structured prompts reduced accuracy. Similarity-score prompts can be powerful but are unstable across datasets. The study assumes perfect blocking and highlights scale/cost limits and failure modes (model-number errors, hallucinations).

Problem Statement

Can an off‑the‑shelf LLM (GPT‑3.5) serve as a high‑quality, low‑cost unsupervised similarity function for entity resolution, and how do different prompt designs trade off accuracy and token cost? The study focuses on the matching (similarity) step only and evaluates six prompt patterns on two e‑commerce benchmarks.

Main Contribution

Systematic comparison of six prompt patterns for GPT‑3.5 used as an unsupervised ER similarity function.

Quantified cost vs. performance tradeoffs on two public product ER benchmarks (WDC, Amazon‑Google).

Key Findings

GPT‑3.5 is viable as an unsupervised ER similarity function on product data.

NumbersMany prompt patterns achieved F1 ≥ 0.80; examples: WDC single-attr F1=0.93, AG multi-sim F1=0.95

Practical UseYou can use GPT‑3.5 directly for pairwise product matching without training a classifier, especially for medium‑scale problems and when blocking reduces pairs.

Evidence RefTable 2; Section 5.1

Simple prompts often give similar or better accuracy at lower cost.

NumbersSingle-attr vs multi-attr: WDC F1 0.93 vs 0.91; cost $0.59 vs $0.93 (≈37% lower)

Practical UseStart with a single‑attribute (title) prompt when a high‑signal attribute exists; it cuts token cost substantially with little accuracy loss on these datasets.

Evidence RefTable 2; Section 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WDC F1 (multi-attr)0.91WDCTable 2: multi-attr F1=0.91, cost $0.93Table 2
WDC F1 (single-attr)0.93multi-attr+0.02WDCTable 2: single-attr F1=0.93, cost $0.59Table 2

What To Try In 7 Days

Run a pilot: apply single-attribute prompts (title) on a blocked candidate set and measure precision/recall.

Compare single-attr vs multi-attr on your data and log token cost per pair.

If similarity scores help, tune a decision threshold on a small labeled sample before scaling up.

Optimization Features

Token Efficiency
Single-attr prompt reduced per-pair token cost by ~37% vs multi-attr (WDC example)Cost scales roughly with input token count; persona and few-shot increase tokens

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Study focuses only on the similarity step and assumes perfect blocking.

Only GPT‑3.5 was evaluated; other LLMs may behave differently.

When Not To Use

When you cannot produce effective blocking and pairs explode in number.

When exact model-number or technical-spec matching is required.

Failure Modes

Hallucinated or incorrect reasoning about identifiers and specs.

Prompt-format sensitivity causing inconsistent outputs across prompt patterns.

Core Entities

Models

GPT-3.5

Metrics

precisionrecallF1 (F‑Measure)

Datasets

WDC (Web Data Commons, computer subset)Amazon-Google Products