AnnoLLM: have GPT‑3.5 explain examples, then use those explanations as few‑shot prompts to label data

March 29, 20237 min

Overview

Decision SnapshotReady For Pilot

AnnoLLM is ready for low‑risk, ruleish labeling workflows and dataset bootstrapping; validate with human audits for hard semantics and high‑stakes labels.

Citations34

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, Weizhu Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.

Who Should Care

Summary TLDR

AnnoLLM turns GPT‑3.5 into a practical annotator by first asking an LLM to explain labeled examples, then using those self‑generated explanations as few‑shot chain‑of‑thought (CoT) prompts to label new data. On three tasks (query/keyword relevance QK, BoolQ, WiC) AnnoLLM matches or beats crowdsourced labels for some tasks (QK, BoolQ) but lags on harder semantic tasks (WiC). The authors also used AnnoLLM to build ConIR, a conversation-based retrieval dataset, and show it is fluent and moderately relevant via human checks.

Problem Statement

Labeling data by humans is slow and costly. The paper asks whether modern LLMs (GPT‑3.5) can replace crowdsourced annotators if guided like humans: task description, label definitions, and example explanations.

Main Contribution

AnnoLLM: a two‑step explain‑then‑annotate pipeline that generates explanations with an LLM and builds few‑shot CoT prompts from them.

Empirical tests on QK, BoolQ, and WiC showing AnnoLLM matches or exceeds crowdsourced accuracy on QK and BoolQ, but not on WiC.

Key Findings

AnnoLLM outperforms crowdsourced annotators on the QK task.

Numbers75.60% (AnnoLLM test) vs 71.5% (crowd)

Practical UseFor query/keyword relevance tasks, use generated explanations + 4‑shot CoT with text‑davinci‑003 to get higher accuracy than a standard crowd pipeline.

Evidence RefTable 2

AnnoLLM matches or slightly exceeds human accuracy on BoolQ (yes/no questions).

Numbers89.20% (AnnoLLM test) vs 89.0% (crowd)

Practical UseFor short passage yes/no labeling, LLM annotation with CoT can replace human annotators with comparable quality.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy75.60%71.5% (crowdsourced)+4.10%QK testAnnoLLM (text-davinci-003 + 4-shot CoT) beats crowdTable 2
Accuracy89.20%89.0% (crowdsourced)+0.20%BoolQ testAnnoLLM (8-shot CoT) matches human performanceTable 4

What To Try In 7 Days

Pick a binary labeling task and craft clear task + category definitions.

Use ChatGPT to produce explanations for 10–50 example labels, build few‑shot CoT prompts, and label a small batch with text‑davinci‑003.

Run a small human audit (100 instances) and compare accuracy and stability across prompt variants.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on LLM quality and prompt design; not uniformly better than humans.

Underperforms on fine semantic tasks (WiC) where subtle human judgment matters.

When Not To Use

Labeling that requires domain experts or high‑stakes correctness.

Fine‑grained word‑sense or other tasks with weak explicit rules.

Failure Modes

LLM hallucinations in generated explanations lead to wrong labels.

Template sensitivity: plain few‑shot prompts can fail drastically with small wording changes.

Core Entities

Models

text-davinci-003gpt3.5-turbo (ChatGPT)PaLMT5 11BDPRPROD

Metrics

AccuracyMRR@10Recall@kFleiss' kappa

Datasets

QK (query-keyword relevance)BoolQWiCConIRMS-MARCO

Benchmarks

SuperGLUE

Context Entities

Models

GPT-3GopherChinchillaLLaMAPaLM 540BST-MoE

Datasets

MS-MARCO (used to build ConIR)