AnnoLLM: have GPT‑3.5 explain examples, then use those explanations as few‑shot prompts to label data

March 29, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

34

Authors

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, Weizhu Chen

Links

Abstract / PDF

Why It Matters For Business

You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.

Summary TLDR

AnnoLLM turns GPT‑3.5 into a practical annotator by first asking an LLM to explain labeled examples, then using those self‑generated explanations as few‑shot chain‑of‑thought (CoT) prompts to label new data. On three tasks (query/keyword relevance QK, BoolQ, WiC) AnnoLLM matches or beats crowdsourced labels for some tasks (QK, BoolQ) but lags on harder semantic tasks (WiC). The authors also used AnnoLLM to build ConIR, a conversation-based retrieval dataset, and show it is fluent and moderately relevant via human checks.

Problem Statement

Labeling data by humans is slow and costly. The paper asks whether modern LLMs (GPT‑3.5) can replace crowdsourced annotators if guided like humans: task description, label definitions, and example explanations.

Main Contribution

AnnoLLM: a two‑step explain‑then‑annotate pipeline that generates explanations with an LLM and builds few‑shot CoT prompts from them.

Empirical tests on QK, BoolQ, and WiC showing AnnoLLM matches or exceeds crowdsourced accuracy on QK and BoolQ, but not on WiC.

Creation of ConIR, a conversation‑based information retrieval dataset produced and filtered with AnnoLLM, with human evaluation of quality.

Key Findings

AnnoLLM outperforms crowdsourced annotators on the QK task.

Numbers75.60% (AnnoLLM test) vs 71.5% (crowd)

AnnoLLM matches or slightly exceeds human accuracy on BoolQ (yes/no questions).

Numbers89.20% (AnnoLLM test) vs 89.0% (crowd)

AnnoLLM underperforms humans on WiC, a hard word‑sense task.

Numbers69.17% (AnnoLLM test) vs 80.0% (crowd)

ConIR conversations are highly fluent but only moderately relevant/consistent to paired passages.

NumbersFluency 4.99/5, Relevance 2.53/3, Consistency 2.41/3; Fleiss' kappa=0.55

CoT prompts made from LLM explanations are more stable across template changes than plain few‑shot prompts.

NumbersFew‑shot templates can drop from ~89 to <80; CoT suffers ~4 points less loss

Results

Accuracy

Value75.60%

Baseline71.5% (crowdsourced)

Accuracy

Value89.20%

Baseline89.0% (crowdsourced)

Accuracy

Value69.17%

Baseline80.0% (crowdsourced)

MRR@10 (ConIR test)

Value19.32

BaselineDPR zero-shot 7.01, PROD zero-shot 10.61

Who Should Care

What To Try In 7 Days

Pick a binary labeling task and craft clear task + category definitions.

Use ChatGPT to produce explanations for 10–50 example labels, build few‑shot CoT prompts, and label a small batch with text‑davinci‑003.

Run a small human audit (100 instances) and compare accuracy and stability across prompt variants.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on LLM quality and prompt design; not uniformly better than humans.
  • Underperforms on fine semantic tasks (WiC) where subtle human judgment matters.
  • Requires API access and has per‑call costs; not free at scale.

When Not To Use

  • Labeling that requires domain experts or high‑stakes correctness.
  • Fine‑grained word‑sense or other tasks with weak explicit rules.
  • When you can cheaply fine‑tune a supervised model with labeled data.

Failure Modes

  • LLM hallucinations in generated explanations lead to wrong labels.
  • Template sensitivity: plain few‑shot prompts can fail drastically with small wording changes.
  • Biases present in the LLM may be reflected in annotations.

Core Entities

Models

  • text-davinci-003
  • gpt3.5-turbo (ChatGPT)
  • PaLM
  • T5 11B
  • DPR
  • PROD

Metrics

  • Accuracy
  • MRR@10
  • Recall@k
  • Fleiss' kappa

Datasets

  • QK (query-keyword relevance)
  • BoolQ
  • WiC
  • ConIR
  • MS-MARCO

Benchmarks

  • SuperGLUE

Context Entities

Models

  • GPT-3
  • Gopher
  • Chinchilla
  • LLaMA
  • PaLM 540B
  • ST-MoE

Datasets

  • MS-MARCO (used to build ConIR)