AnnoLLM: have GPT‑3.5 explain examples, then use those explanations as few‑shot prompts to label data

Overview

Decision SnapshotReady For Pilot

AnnoLLM is ready for low‑risk, ruleish labeling workflows and dataset bootstrapping; validate with human audits for hard semantics and high‑stakes labels.

Citations34

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, Weizhu Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

AnnoLLM turns GPT‑3.5 into a practical annotator by first asking an LLM to explain labeled examples, then using those self‑generated explanations as few‑shot chain‑of‑thought (CoT) prompts to label new data. On three tasks (query/keyword relevance QK, BoolQ, WiC) AnnoLLM matches or beats crowdsourced labels for some tasks (QK, BoolQ) but lags on harder semantic tasks (WiC). The authors also used AnnoLLM to build ConIR, a conversation-based retrieval dataset, and show it is fluent and moderately relevant via human checks.

Problem Statement

Labeling data by humans is slow and costly. The paper asks whether modern LLMs (GPT‑3.5) can replace crowdsourced annotators if guided like humans: task description, label definitions, and example explanations.

Main Contribution

AnnoLLM: a two‑step explain‑then‑annotate pipeline that generates explanations with an LLM and builds few‑shot CoT prompts from them.

Empirical tests on QK, BoolQ, and WiC showing AnnoLLM matches or exceeds crowdsourced accuracy on QK and BoolQ, but not on WiC.

Key Findings

AnnoLLM outperforms crowdsourced annotators on the QK task.

Numbers75.60% (AnnoLLM test) vs 71.5% (crowd)

Practical UseFor query/keyword relevance tasks, use generated explanations + 4‑shot CoT with text‑davinci‑003 to get higher accuracy than a standard crowd pipeline.

Evidence RefTable 2

AnnoLLM matches or slightly exceeds human accuracy on BoolQ (yes/no questions).

Numbers89.20% (AnnoLLM test) vs 89.0% (crowd)

Practical UseFor short passage yes/no labeling, LLM annotation with CoT can replace human annotators with comparable quality.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	75.60%	71.5% (crowdsourced)	+4.10%	QK test	AnnoLLM (text-davinci-003 + 4-shot CoT) beats crowd	Table 2
Accuracy	89.20%	89.0% (crowdsourced)	+0.20%	BoolQ test	AnnoLLM (8-shot CoT) matches human performance	Table 4

What To Try In 7 Days

Pick a binary labeling task and craft clear task + category definitions.

Use ChatGPT to produce explanations for 10–50 example labels, build few‑shot CoT prompts, and label a small batch with text‑davinci‑003.

Run a small human audit (100 instances) and compare accuracy and stability across prompt variants.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NLPCode/AnnoLLM

Data URLs

https://github.com/NLPCode/AnnoLLM (ConIR dataset link in repo)

Risks & Boundaries

Limitations

Performance depends on LLM quality and prompt design; not uniformly better than humans.

Underperforms on fine semantic tasks (WiC) where subtle human judgment matters.

When Not To Use

Labeling that requires domain experts or high‑stakes correctness.

Fine‑grained word‑sense or other tasks with weak explicit rules.

Failure Modes

LLM hallucinations in generated explanations lead to wrong labels.

Template sensitivity: plain few‑shot prompts can fail drastically with small wording changes.

Core Entities

Models

text-davinci-003gpt3.5-turbo (ChatGPT)PaLMT5 11BDPRPROD

Metrics

AccuracyMRR@10Recall@kFleiss' kappa

Datasets

QK (query-keyword relevance)BoolQWiCConIRMS-MARCO

Benchmarks

SuperGLUE

Context Entities

Models

GPT-3GopherChinchillaLLaMAPaLM 540BST-MoE

Datasets

MS-MARCO (used to build ConIR)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AnnoLLM outperforms crowdsourced annotators on the QK task.

AnnoLLM matches or slightly exceeds human accuracy on BoolQ (yes/no questions).

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding