Cheap prompts often match expensive ones: GPT‑3.5 can do unsupervised product entity resolution cost‑efficiently

October 9, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

2

Authors

Navapat Nananukul, Khanin Sisaengsuwanchai, Mayank Kejriwal

Links

Abstract / PDF

Why It Matters For Business

LLMs let you do pairwise ER without labeled training data and with lower engineering effort; using short prompts can cut API costs substantially but you must combine LLMs with blocking to control scale.

Summary TLDR

This paper tests six prompt designs for using GPT‑3.5 as an unsupervised similarity function for product entity resolution (ER). On two e‑commerce benchmarks (WDC, Amazon‑Google) GPT‑3.5 achieves competitive F1 scores (many prompts 0.8+). Simpler prompts (single-attribute/title) often match or beat costlier ones while costing ~30–40% less in tokens. JSON-structured prompts reduced accuracy. Similarity-score prompts can be powerful but are unstable across datasets. The study assumes perfect blocking and highlights scale/cost limits and failure modes (model-number errors, hallucinations).

Problem Statement

Can an off‑the‑shelf LLM (GPT‑3.5) serve as a high‑quality, low‑cost unsupervised similarity function for entity resolution, and how do different prompt designs trade off accuracy and token cost? The study focuses on the matching (similarity) step only and evaluates six prompt patterns on two e‑commerce benchmarks.

Main Contribution

Systematic comparison of six prompt patterns for GPT‑3.5 used as an unsupervised ER similarity function.

Quantified cost vs. performance tradeoffs on two public product ER benchmarks (WDC, Amazon‑Google).

Qualitative error analysis highlighting common failure modes (model numbers, technical specs, hallucinations).

Analysis of inter-method disagreement and statistical significance of prompt effects.

Public release of raw data and analysis artifacts for replication.

Key Findings

GPT‑3.5 is viable as an unsupervised ER similarity function on product data.

NumbersMany prompt patterns achieved F1 ≥ 0.80; examples: WDC single-attr F1=0.93, AG multi-sim F1=0.95

Simple prompts often give similar or better accuracy at lower cost.

NumbersSingle-attr vs multi-attr: WDC F1 0.93 vs 0.91; cost $0.59 vs $0.93 (≈37% lower)

Structured JSON prompts reduced accuracy in these tests.

Numbersmulti-json F1 dropped WDC 0.81 (from 0.91) and AG 0.69 (from 0.87)

Similarity-score prompts are powerful but inconsistent across datasets.

Numbersmulti-sim F1: WDC 0.71 (drop) vs Amazon‑Google 0.95 (improvement)

Scale and blocking dominate operational cost.

NumbersEvaluated ≈12k pairs; full cross-product would be ≈3M pairs; even 95% blocking leaves ≈150k pairs (10–15× more cost)

Results

WDC F1 (multi-attr)

Value0.91

WDC F1 (single-attr)

Value0.93

Baselinemulti-attr

Amazon-Google F1 (multi-attr)

Value0.87

Amazon-Google F1 (single-attr)

Value0.81

Baselinemulti-attr

Format effect (multi-json vs multi-attr)

ValueF1 drop 0.10 (WDC) / 0.18 (Amazon-Google)

Baselinemulti-attr

Who Should Care

What To Try In 7 Days

Run a pilot: apply single-attribute prompts (title) on a blocked candidate set and measure precision/recall.

Compare single-attr vs multi-attr on your data and log token cost per pair.

If similarity scores help, tune a decision threshold on a small labeled sample before scaling up.

Optimization Features

Token Efficiency

  • Single-attr prompt reduced per-pair token cost by ~37% vs multi-attr (WDC example)
  • Cost scales roughly with input token count; persona and few-shot increase tokens

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Study focuses only on the similarity step and assumes perfect blocking.
  • Only GPT‑3.5 was evaluated; other LLMs may behave differently.
  • JSON formatting and extra attributes sometimes harmed accuracy.
  • Model confuses fine-grained identifiers (model numbers) and can hallucinate explanations.
  • Cost estimates ignore full cross-product pair explosion without blocking.

When Not To Use

  • When you cannot produce effective blocking and pairs explode in number.
  • When exact model-number or technical-spec matching is required.
  • If strict, auditable explanations are required and LLM hallucination is unacceptable.

Failure Modes

  • Hallucinated or incorrect reasoning about identifiers and specs.
  • Prompt-format sensitivity causing inconsistent outputs across prompt patterns.
  • Similarity scores that require per-dataset threshold tuning.
  • All methods agreeing but all being wrong on hard pairs.

Core Entities

Models

  • GPT-3.5

Metrics

  • precision
  • recall
  • F1 (F‑Measure)

Datasets

  • WDC (Web Data Commons, computer subset)
  • Amazon-Google Products