Overview
AnnoLLM is ready for low‑risk, ruleish labeling workflows and dataset bootstrapping; validate with human audits for hard semantics and high‑stakes labels.
Citations34
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
You can cheaply scale annotation for rule-like labeling tasks by prompting LLMs with self‑generated explanations; this can cut human labeling needs for some tasks and bootstrap retrieval datasets quickly.
Who Should Care
Summary TLDR
AnnoLLM turns GPT‑3.5 into a practical annotator by first asking an LLM to explain labeled examples, then using those self‑generated explanations as few‑shot chain‑of‑thought (CoT) prompts to label new data. On three tasks (query/keyword relevance QK, BoolQ, WiC) AnnoLLM matches or beats crowdsourced labels for some tasks (QK, BoolQ) but lags on harder semantic tasks (WiC). The authors also used AnnoLLM to build ConIR, a conversation-based retrieval dataset, and show it is fluent and moderately relevant via human checks.
Problem Statement
Labeling data by humans is slow and costly. The paper asks whether modern LLMs (GPT‑3.5) can replace crowdsourced annotators if guided like humans: task description, label definitions, and example explanations.
Main Contribution
AnnoLLM: a two‑step explain‑then‑annotate pipeline that generates explanations with an LLM and builds few‑shot CoT prompts from them.
Empirical tests on QK, BoolQ, and WiC showing AnnoLLM matches or exceeds crowdsourced accuracy on QK and BoolQ, but not on WiC.
Key Findings
AnnoLLM outperforms crowdsourced annotators on the QK task.
AnnoLLM matches or slightly exceeds human accuracy on BoolQ (yes/no questions).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 75.60% | 71.5% (crowdsourced) | +4.10% | QK test | AnnoLLM (text-davinci-003 + 4-shot CoT) beats crowd | Table 2 |
| Accuracy | 89.20% | 89.0% (crowdsourced) | +0.20% | BoolQ test | AnnoLLM (8-shot CoT) matches human performance | Table 4 |
What To Try In 7 Days
Pick a binary labeling task and craft clear task + category definitions.
Use ChatGPT to produce explanations for 10–50 example labels, build few‑shot CoT prompts, and label a small batch with text‑davinci‑003.
Run a small human audit (100 instances) and compare accuracy and stability across prompt variants.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Performance depends on LLM quality and prompt design; not uniformly better than humans.
Underperforms on fine semantic tasks (WiC) where subtle human judgment matters.
When Not To Use
Labeling that requires domain experts or high‑stakes correctness.
Fine‑grained word‑sense or other tasks with weak explicit rules.
Failure Modes
LLM hallucinations in generated explanations lead to wrong labels.
Template sensitivity: plain few‑shot prompts can fail drastically with small wording changes.

