Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
4
Why It Matters For Business
IMFL cuts expert labeling costs by replacing many expensive human labels with cheaper LLM labels while keeping near-human performance on key domain tasks, enabling faster, cheaper domain model launches.
Summary TLDR
IMFL is a practical loop for fine-tuning small domain models using a small human label budget plus many cheaper LLM (GPT-3.5) labels. Key pieces: an exploration–exploitation query (EEQ) that mixes diversity and uncertainty sampling, prompt-retrieval (Sentence-BERT) to produce better in-context examples for the LLM annotator, and a decaying human batch-size schedule. On four finance/medical tasks, IMFL (200 human + 800 GPT-3.5 labels) beats a 3× human-only baseline and approaches a 5× human upper bound on two datasets, while outperforming all-GPT annotations.
Problem Statement
Domain fine-tuning needs many high-quality human labels, which are expensive and slow. Using LLMs to auto-label is cheap but prone to hallucinations and noise. The paper asks: how to allocate a fixed annotation budget between humans and LLM auto-annotations to maximize downstream model performance?
Main Contribution
Formulate domain adaption as interactive multi-fidelity learning that mixes high-fidelity human labels and low-fidelity LLM labels under a budget.
Design an exploration-exploitation query (EEQ): cluster-based diversity plus uncertainty sampling to split queries between LLM and humans.
Improve LLM auto-labels via similarity-based prompt retrieval (Sentence-BERT) that provides in-context examples from human-labeled data.
Use a variable, decaying human batch-size schedule (more humans early) to support better retrieval and knowledge distillation.
Empirically show IMFL (200 human + 800 GPT-3.5) outperforms 3× human and all-GPT baselines on four domain tasks.
Key Findings
Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.
IMFL approaches the 5× human-label upper bound on some tasks with small loss.
Combining prompt retrieval and variable human batch sizes improves LLM annotation quality.
EEQ query (cluster diversity + uncertainty) beats random sampling under same budget.
Results
FPB F1 (IMFL vs 3× Human)
Headline F1 (IMFL vs 5× Human)
Accuracy
IMFL vs All-GPT-3.5 (same 1000 budget)
Who Should Care
What To Try In 7 Days
Run a small pilot: pick 200 expert-labeled seed samples and 800 LLM-labels.
Implement Sentence-BERT retrieval to supply 3–5 nearest human examples as LLM prompts.
Use EEQ: k-means clustering for diversity then least-confidence selection for human queries each round (R≈5).
Agent Features
Planning
- LoRA
Tool Use
- LLM as annotator (GPT-3.5)
- Sentence-BERT prompt retrieval
- LoRA
Optimization Features
Token Efficiency
- substitute many human annotations with cheaper LLM annotations (20/80 human/LLM default)
Infra Optimization
- use smaller base models (dolly-v2-3b) to cut compute
Model Optimization
- LoRA
System Optimization
- random sub-sampling of unlabeled pool to reduce acquisition compute
Training Optimization
- interactive multi-round fine-tuning (R=5)
- variable human batch-size decay to front-load human labels
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Budget modeled as annotation count, not true multi-dimensional cost (time, admin, training).
- Performance depends on LLM annotator quality; noisy LLM labels can hurt (MedQA example).
- Limited by size/diversity of available unlabeled pool; it does not synthesize new data.
- Does not reach state-of-the-art vs very large LLMs on some tasks.
When Not To Use
- You have ample expert labels and budget—pure human labels may be simpler.
- Task where available LLM annotator is known to perform poorly (e.g., niche medical subsets).
- When annotation cost is not roughly proportional to label count (complex admin costs dominate).
Failure Modes
- Noisy LLM annotations overwhelm the small human seed and reduce final accuracy.
- Too few human examples break prompt retrieval usefulness (performance drops for 0.25× humans).
- Clustering/diversity step picks poor candidates when embeddings are weak or domain-shifted.
- LLM annotator distributional mismatch causes systematic label bias.
Core Entities
Models
- dolly-v2-3b
- dolly-v2-7b
- dolly-v2-12b
- GPT-3.5
- GPT-3
- GPT-4
Metrics
- F1 score
- Average F1
- Accuracy
Datasets
- FPB (Financial Phrasebank)
- Headline (gold commodity news)
- PubMedQA
- MedQA

