Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
34
Why It Matters For Business
You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.
Summary TLDR
AnnoLLM turns GPT‑3.5 into a practical annotator by first asking an LLM to explain labeled examples, then using those self‑generated explanations as few‑shot chain‑of‑thought (CoT) prompts to label new data. On three tasks (query/keyword relevance QK, BoolQ, WiC) AnnoLLM matches or beats crowdsourced labels for some tasks (QK, BoolQ) but lags on harder semantic tasks (WiC). The authors also used AnnoLLM to build ConIR, a conversation-based retrieval dataset, and show it is fluent and moderately relevant via human checks.
Problem Statement
Labeling data by humans is slow and costly. The paper asks whether modern LLMs (GPT‑3.5) can replace crowdsourced annotators if guided like humans: task description, label definitions, and example explanations.
Main Contribution
AnnoLLM: a two‑step explain‑then‑annotate pipeline that generates explanations with an LLM and builds few‑shot CoT prompts from them.
Empirical tests on QK, BoolQ, and WiC showing AnnoLLM matches or exceeds crowdsourced accuracy on QK and BoolQ, but not on WiC.
Creation of ConIR, a conversation‑based information retrieval dataset produced and filtered with AnnoLLM, with human evaluation of quality.
Key Findings
AnnoLLM outperforms crowdsourced annotators on the QK task.
AnnoLLM matches or slightly exceeds human accuracy on BoolQ (yes/no questions).
AnnoLLM underperforms humans on WiC, a hard word‑sense task.
ConIR conversations are highly fluent but only moderately relevant/consistent to paired passages.
CoT prompts made from LLM explanations are more stable across template changes than plain few‑shot prompts.
Results
Accuracy
Accuracy
Accuracy
MRR@10 (ConIR test)
Who Should Care
What To Try In 7 Days
Pick a binary labeling task and craft clear task + category definitions.
Use ChatGPT to produce explanations for 10–50 example labels, build few‑shot CoT prompts, and label a small batch with text‑davinci‑003.
Run a small human audit (100 instances) and compare accuracy and stability across prompt variants.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance depends on LLM quality and prompt design; not uniformly better than humans.
- Underperforms on fine semantic tasks (WiC) where subtle human judgment matters.
- Requires API access and has per‑call costs; not free at scale.
When Not To Use
- Labeling that requires domain experts or high‑stakes correctness.
- Fine‑grained word‑sense or other tasks with weak explicit rules.
- When you can cheaply fine‑tune a supervised model with labeled data.
Failure Modes
- LLM hallucinations in generated explanations lead to wrong labels.
- Template sensitivity: plain few‑shot prompts can fail drastically with small wording changes.
- Biases present in the LLM may be reflected in annotations.
Core Entities
Models
- text-davinci-003
- gpt3.5-turbo (ChatGPT)
- PaLM
- T5 11B
- DPR
- PROD
Metrics
- Accuracy
- MRR@10
- Recall@k
- Fleiss' kappa
Datasets
- QK (query-keyword relevance)
- BoolQ
- WiC
- ConIR
- MS-MARCO
Benchmarks
- SuperGLUE
Context Entities
Models
- GPT-3
- Gopher
- Chinchilla
- LLaMA
- PaLM 540B
- ST-MoE
Datasets
- MS-MARCO (used to build ConIR)

