Cut domain-specific annotation cost: mix a small set of human labels with many GPT-3.5 labels using smart sampling and prompt retrieval

Overview

Decision SnapshotNeeds Validation

The method is practical: it uses public datasets, small models, and standard components (k-means, Sentence-BERT, LoRA). Results are consistent across four datasets but depend on LLM annotator quality and unlabeled pool size.

Citations4

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Jiaxin Zhang, Zhuohang Li, Kamalika Das, Sricharan Kumar

Links

Abstract / PDF / Data

Why It Matters For Business

IMFL cuts expert labeling costs by replacing many expensive human labels with cheaper LLM labels while keeping near-human performance on key domain tasks, enabling faster, cheaper domain model launches.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

IMFL is a practical loop for fine-tuning small domain models using a small human label budget plus many cheaper LLM (GPT-3.5) labels. Key pieces: an exploration–exploitation query (EEQ) that mixes diversity and uncertainty sampling, prompt-retrieval (Sentence-BERT) to produce better in-context examples for the LLM annotator, and a decaying human batch-size schedule. On four finance/medical tasks, IMFL (200 human + 800 GPT-3.5 labels) beats a 3× human-only baseline and approaches a 5× human upper bound on two datasets, while outperforming all-GPT annotations.

Problem Statement

Domain fine-tuning needs many high-quality human labels, which are expensive and slow. Using LLMs to auto-label is cheap but prone to hallucinations and noise. The paper asks: how to allocate a fixed annotation budget between humans and LLM auto-annotations to maximize downstream model performance?

Main Contribution

Formulate domain adaption as interactive multi-fidelity learning that mixes high-fidelity human labels and low-fidelity LLM labels under a budget.

Design an exploration-exploitation query (EEQ): cluster-based diversity plus uncertainty sampling to split queries between LLM and humans.

Key Findings

Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.

NumbersFPB +4.72 F1; Headline +6.96 F1; PubMedQA +3.61 acc; MedQA +9.67 acc

Practical UseUse ~20% human + 80% cheaper LLM labels and IMFL's query design to get better models than spending 3× the human labels.

Evidence RefTable 9 (detailed main results)

IMFL approaches the 5× human-label upper bound on some tasks with small loss.

NumbersHeadline −0.83 pp; PubMedQA −1.32 pp absolute

Practical UseIf budget is tight, IMFL can reach near full-human performance on some tasks while saving large annotation cost.

Evidence RefMain results (Fig.4) and Table 9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FPB F1 (IMFL vs 3× Human)	47.88 (IMFL)	43.16 (3× Human)	+4.72	FPB (financial)	Table 9 main results	Table 9
Headline F1 (IMFL vs 5× Human)	81.09 (IMFL)	81.92 (5× Human)	-0.83	Headline (financial)	Fig.4 and Table 9	Fig.4; Table 9

What To Try In 7 Days

Run a small pilot: pick 200 expert-labeled seed samples and 800 LLM-labels.

Implement Sentence-BERT retrieval to supply 3–5 nearest human examples as LLM prompts.

Use EEQ: k-means clustering for diversity then least-confidence selection for human queries each round (R≈5).

Agent Features

Planning

LoRA

Tool Use

LLM as annotator (GPT-3.5)Sentence-BERT prompt retrievalLoRA

Optimization Features

Token Efficiency

substitute many human annotations with cheaper LLM annotations (20/80 human/LLM default)

Infra Optimization

use smaller base models (dolly-v2-3b) to cut compute

Model Optimization

LoRA

System Optimization

random sub-sampling of unlabeled pool to reduce acquisition compute

Training Optimization

interactive multi-round fine-tuning (R=5)variable human batch-size decay to front-load human labels

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/financial_phrasebank https://huggingface.co/datasets/pubmed_qa https://github.com/jind11/MedQA

Risks & Boundaries

Limitations

Budget modeled as annotation count, not true multi-dimensional cost (time, admin, training).

Performance depends on LLM annotator quality; noisy LLM labels can hurt (MedQA example).

When Not To Use

You have ample expert labels and budget—pure human labels may be simpler.

Task where available LLM annotator is known to perform poorly (e.g., niche medical subsets).

Failure Modes

Noisy LLM annotations overwhelm the small human seed and reduce final accuracy.

Too few human examples break prompt retrieval usefulness (performance drops for 0.25× humans).

Core Entities

Models

dolly-v2-3bdolly-v2-7bdolly-v2-12bGPT-3.5GPT-3GPT-4

Metrics

F1 scoreAverage F1Accuracy

Datasets

FPB (Financial Phrasebank)Headline (gold commodity news)PubMedQAMedQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.

IMFL approaches the 5× human-label upper bound on some tasks with small loss.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding