Cut domain-specific annotation cost: mix a small set of human labels with many GPT-3.5 labels using smart sampling and prompt retrieval

October 31, 20238 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it uses public datasets, small models, and standard components (k-means, Sentence-BERT, LoRA). Results are consistent across four datasets but depend on LLM annotator quality and unlabeled pool size.

Citations4

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Jiaxin Zhang, Zhuohang Li, Kamalika Das, Sricharan Kumar

Links

Abstract / PDF / Data

Why It Matters For Business

IMFL cuts expert labeling costs by replacing many expensive human labels with cheaper LLM labels while keeping near-human performance on key domain tasks, enabling faster, cheaper domain model launches.

Who Should Care

Summary TLDR

IMFL is a practical loop for fine-tuning small domain models using a small human label budget plus many cheaper LLM (GPT-3.5) labels. Key pieces: an exploration–exploitation query (EEQ) that mixes diversity and uncertainty sampling, prompt-retrieval (Sentence-BERT) to produce better in-context examples for the LLM annotator, and a decaying human batch-size schedule. On four finance/medical tasks, IMFL (200 human + 800 GPT-3.5 labels) beats a 3× human-only baseline and approaches a 5× human upper bound on two datasets, while outperforming all-GPT annotations.

Problem Statement

Domain fine-tuning needs many high-quality human labels, which are expensive and slow. Using LLMs to auto-label is cheap but prone to hallucinations and noise. The paper asks: how to allocate a fixed annotation budget between humans and LLM auto-annotations to maximize downstream model performance?

Main Contribution

Formulate domain adaption as interactive multi-fidelity learning that mixes high-fidelity human labels and low-fidelity LLM labels under a budget.

Design an exploration-exploitation query (EEQ): cluster-based diversity plus uncertainty sampling to split queries between LLM and humans.

Key Findings

Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.

NumbersFPB +4.72 F1; Headline +6.96 F1; PubMedQA +3.61 acc; MedQA +9.67 acc

Practical UseUse ~20% human + 80% cheaper LLM labels and IMFL's query design to get better models than spending 3× the human labels.

Evidence RefTable 9 (detailed main results)

IMFL approaches the 5× human-label upper bound on some tasks with small loss.

NumbersHeadline −0.83 pp; PubMedQA −1.32 pp absolute

Practical UseIf budget is tight, IMFL can reach near full-human performance on some tasks while saving large annotation cost.

Evidence RefMain results (Fig.4) and Table 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FPB F1 (IMFL vs 3× Human)47.88 (IMFL)43.16 ( Human)+4.72FPB (financial)Table 9 main resultsTable 9
Headline F1 (IMFL vs 5× Human)81.09 (IMFL)81.92 ( Human)-0.83Headline (financial)Fig.4 and Table 9Fig.4; Table 9

What To Try In 7 Days

Run a small pilot: pick 200 expert-labeled seed samples and 800 LLM-labels.

Implement Sentence-BERT retrieval to supply 3–5 nearest human examples as LLM prompts.

Use EEQ: k-means clustering for diversity then least-confidence selection for human queries each round (R≈5).

Agent Features

Planning
LoRA
Tool Use
LLM as annotator (GPT-3.5)Sentence-BERT prompt retrievalLoRA

Optimization Features

Token Efficiency
substitute many human annotations with cheaper LLM annotations (20/80 human/LLM default)
Infra Optimization
use smaller base models (dolly-v2-3b) to cut compute
Model Optimization
LoRA
System Optimization
random sub-sampling of unlabeled pool to reduce acquisition compute
Training Optimization
interactive multi-round fine-tuning (R=5)variable human batch-size decay to front-load human labels

Reproducibility

Risks & Boundaries

Limitations

Budget modeled as annotation count, not true multi-dimensional cost (time, admin, training).

Performance depends on LLM annotator quality; noisy LLM labels can hurt (MedQA example).

When Not To Use

You have ample expert labels and budget—pure human labels may be simpler.

Task where available LLM annotator is known to perform poorly (e.g., niche medical subsets).

Failure Modes

Noisy LLM annotations overwhelm the small human seed and reduce final accuracy.

Too few human examples break prompt retrieval usefulness (performance drops for 0.25× humans).

Core Entities

Models

dolly-v2-3bdolly-v2-7bdolly-v2-12bGPT-3.5GPT-3GPT-4

Metrics

F1 scoreAverage F1Accuracy

Datasets

FPB (Financial Phrasebank)Headline (gold commodity news)PubMedQAMedQA