Overview
The method is practical: it uses public datasets, small models, and standard components (k-means, Sentence-BERT, LoRA). Results are consistent across four datasets but depend on LLM annotator quality and unlabeled pool size.
Citations4
Evidence Strength0.70
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
IMFL cuts expert labeling costs by replacing many expensive human labels with cheaper LLM labels while keeping near-human performance on key domain tasks, enabling faster, cheaper domain model launches.
Who Should Care
Summary TLDR
IMFL is a practical loop for fine-tuning small domain models using a small human label budget plus many cheaper LLM (GPT-3.5) labels. Key pieces: an exploration–exploitation query (EEQ) that mixes diversity and uncertainty sampling, prompt-retrieval (Sentence-BERT) to produce better in-context examples for the LLM annotator, and a decaying human batch-size schedule. On four finance/medical tasks, IMFL (200 human + 800 GPT-3.5 labels) beats a 3× human-only baseline and approaches a 5× human upper bound on two datasets, while outperforming all-GPT annotations.
Problem Statement
Domain fine-tuning needs many high-quality human labels, which are expensive and slow. Using LLMs to auto-label is cheap but prone to hallucinations and noise. The paper asks: how to allocate a fixed annotation budget between humans and LLM auto-annotations to maximize downstream model performance?
Main Contribution
Formulate domain adaption as interactive multi-fidelity learning that mixes high-fidelity human labels and low-fidelity LLM labels under a budget.
Design an exploration-exploitation query (EEQ): cluster-based diversity plus uncertainty sampling to split queries between LLM and humans.
Key Findings
Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.
IMFL approaches the 5× human-label upper bound on some tasks with small loss.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| FPB F1 (IMFL vs 3× Human) | 47.88 (IMFL) | 43.16 (3× Human) | +4.72 | FPB (financial) | Table 9 main results | Table 9 |
| Headline F1 (IMFL vs 5× Human) | 81.09 (IMFL) | 81.92 (5× Human) | -0.83 | Headline (financial) | Fig.4 and Table 9 | Fig.4; Table 9 |
What To Try In 7 Days
Run a small pilot: pick 200 expert-labeled seed samples and 800 LLM-labels.
Implement Sentence-BERT retrieval to supply 3–5 nearest human examples as LLM prompts.
Use EEQ: k-means clustering for diversity then least-confidence selection for human queries each round (R≈5).
Agent Features
Planning
Tool Use
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Budget modeled as annotation count, not true multi-dimensional cost (time, admin, training).
Performance depends on LLM annotator quality; noisy LLM labels can hurt (MedQA example).
When Not To Use
You have ample expert labels and budget—pure human labels may be simpler.
Task where available LLM annotator is known to perform poorly (e.g., niche medical subsets).
Failure Modes
Noisy LLM annotations overwhelm the small human seed and reduce final accuracy.
Too few human examples break prompt retrieval usefulness (performance drops for 0.25× humans).

