Cut domain-specific annotation cost: mix a small set of human labels with many GPT-3.5 labels using smart sampling and prompt retrieval

October 31, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

4

Authors

Jiaxin Zhang, Zhuohang Li, Kamalika Das, Sricharan Kumar

Links

Abstract / PDF

Why It Matters For Business

IMFL cuts expert labeling costs by replacing many expensive human labels with cheaper LLM labels while keeping near-human performance on key domain tasks, enabling faster, cheaper domain model launches.

Summary TLDR

IMFL is a practical loop for fine-tuning small domain models using a small human label budget plus many cheaper LLM (GPT-3.5) labels. Key pieces: an exploration–exploitation query (EEQ) that mixes diversity and uncertainty sampling, prompt-retrieval (Sentence-BERT) to produce better in-context examples for the LLM annotator, and a decaying human batch-size schedule. On four finance/medical tasks, IMFL (200 human + 800 GPT-3.5 labels) beats a 3× human-only baseline and approaches a 5× human upper bound on two datasets, while outperforming all-GPT annotations.

Problem Statement

Domain fine-tuning needs many high-quality human labels, which are expensive and slow. Using LLMs to auto-label is cheap but prone to hallucinations and noise. The paper asks: how to allocate a fixed annotation budget between humans and LLM auto-annotations to maximize downstream model performance?

Main Contribution

Formulate domain adaption as interactive multi-fidelity learning that mixes high-fidelity human labels and low-fidelity LLM labels under a budget.

Design an exploration-exploitation query (EEQ): cluster-based diversity plus uncertainty sampling to split queries between LLM and humans.

Improve LLM auto-labels via similarity-based prompt retrieval (Sentence-BERT) that provides in-context examples from human-labeled data.

Use a variable, decaying human batch-size schedule (more humans early) to support better retrieval and knowledge distillation.

Empirically show IMFL (200 human + 800 GPT-3.5) outperforms 3× human and all-GPT baselines on four domain tasks.

Key Findings

Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.

NumbersFPB +4.72 F1; Headline +6.96 F1; PubMedQA +3.61 acc; MedQA +9.67 acc

IMFL approaches the 5× human-label upper bound on some tasks with small loss.

NumbersHeadline −0.83 pp; PubMedQA −1.32 pp absolute

Combining prompt retrieval and variable human batch sizes improves LLM annotation quality.

NumbersHeadline: similarity retrieval 80.28 vs random 73.77 F1; PubMedQA: 72.05 vs 68.10

EEQ query (cluster diversity + uncertainty) beats random sampling under same budget.

NumbersAverage +5.91 absolute (example: Headline 81.09 vs 74.32)

Results

FPB F1 (IMFL vs 3× Human)

Value47.88 (IMFL)

Baseline43.16 (3× Human)

Headline F1 (IMFL vs 5× Human)

Value81.09 (IMFL)

Baseline81.92 (5× Human)

Accuracy

Value73.76 (IMFL)

Baseline75.08 (5× Human)

IMFL vs All-GPT-3.5 (same 1000 budget)

ValueFPB +7.35 F1; Headline +8.30 F1; PubMedQA +6.89 acc; MedQA +19.95 acc

BaselineAll GPT-3.5 (1000)

Who Should Care

What To Try In 7 Days

Run a small pilot: pick 200 expert-labeled seed samples and 800 LLM-labels.

Implement Sentence-BERT retrieval to supply 3–5 nearest human examples as LLM prompts.

Use EEQ: k-means clustering for diversity then least-confidence selection for human queries each round (R≈5).

Agent Features

Planning

  • LoRA

Tool Use

  • LLM as annotator (GPT-3.5)
  • Sentence-BERT prompt retrieval
  • LoRA

Optimization Features

Token Efficiency

  • substitute many human annotations with cheaper LLM annotations (20/80 human/LLM default)

Infra Optimization

  • use smaller base models (dolly-v2-3b) to cut compute

Model Optimization

  • LoRA

System Optimization

  • random sub-sampling of unlabeled pool to reduce acquisition compute

Training Optimization

  • interactive multi-round fine-tuning (R=5)
  • variable human batch-size decay to front-load human labels

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Budget modeled as annotation count, not true multi-dimensional cost (time, admin, training).
  • Performance depends on LLM annotator quality; noisy LLM labels can hurt (MedQA example).
  • Limited by size/diversity of available unlabeled pool; it does not synthesize new data.
  • Does not reach state-of-the-art vs very large LLMs on some tasks.

When Not To Use

  • You have ample expert labels and budget—pure human labels may be simpler.
  • Task where available LLM annotator is known to perform poorly (e.g., niche medical subsets).
  • When annotation cost is not roughly proportional to label count (complex admin costs dominate).

Failure Modes

  • Noisy LLM annotations overwhelm the small human seed and reduce final accuracy.
  • Too few human examples break prompt retrieval usefulness (performance drops for 0.25× humans).
  • Clustering/diversity step picks poor candidates when embeddings are weak or domain-shifted.
  • LLM annotator distributional mismatch causes systematic label bias.

Core Entities

Models

  • dolly-v2-3b
  • dolly-v2-7b
  • dolly-v2-12b
  • GPT-3.5
  • GPT-3
  • GPT-4

Metrics

  • F1 score
  • Average F1
  • Accuracy

Datasets

  • FPB (Financial Phrasebank)
  • Headline (gold commodity news)
  • PubMedQA
  • MedQA