Pick fine‑tuning data by clustering loss curves of a small proxy model

Overview

Decision SnapshotReady For Pilot

S2L is simple and reproducible: collect per-example loss from a small proxy, cluster trajectories, sample across clusters, and fine‑tune. Theory bounds gradient error; experiments show cross-domain gains but testing is limited to math and clinical data and models ≤7B.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

License: MathInstruct: MIT; MIMIC-III: DUA required

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman

Links

Abstract / PDF / Code

Why It Matters For Business

S2L can cut fine‑tuning data by up to ~89% on the evaluated math tasks and halve data/train time in clinical summarization, lowering compute, storage, and labeling costs while keeping or improving accuracy.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

S2L records loss values over training for each example on a small, cheap proxy model, clusters those loss trajectories, then draws balanced samples from all clusters to make a fine‑tuning subset for a large model. The paper proves clustered trajectories imply similar gradients and gives a convergence bound for training on the subset. Empirically, S2L matches full-data performance with just 11% of MathInstruct, improves average accuracy over SOTA selection by ~4.7% across six math datasets, gives 32.7% on the hard MATH benchmark from 50K examples (+16.6% vs Phi-2), and improves clinical summarization while cutting data in half. Code is public.

Problem Statement

Supervised fine‑tuning (SFT) for specialized domains is expensive and data‑hungry. Existing selection methods rely on embeddings or confidence from large reference models, which (1) can fail when fine‑tuning data differs from pretraining data and (2) are costly to compute for large models. The paper asks: can a small proxy model's training dynamics identify a small, high‑quality subset that trains large models almost as well?

Main Contribution

S2L algorithm: cluster per-example loss trajectories from a small proxy and uniformly sample across clusters to build a training subset.

Theory: prove examples in the same loss-trajectory cluster have similar gradients and give a bounded-gradient-error convergence guarantee for incremental gradient training on the subset.

Key Findings

S2L matches full MathInstruct performance using only ~11% of the data.

Numbers11% of MathInstruct (~30K of 262K)

Practical UseYou can fine‑tune large models with ~90% less data on similar math tasks and keep model quality.

Evidence RefAbstract; Sec 5.2; Fig.4

S2L outperforms other open-source data selection methods on average.

Numbersavg +4.7% accuracy vs SOTA across 6 datasets

Practical UseSwitching to S2L is likely to give consistent accuracy gains over common selection heuristics.

Evidence RefAbstract; Sec 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	S2L with 30K matches/exceeds full-data training	Full MathInstruct (262K)	match or exceed	MathInstruct -> 6 eval datasets (GSM8K, MATH, NumGLUE, SVAMP, Mathematics, SimulEq)	Fig.4; Sec 5.2	Fig.4
Accuracy	32.7% with 50K S2L-selected examples	pretrained Phi-2	+16.6% vs Phi-2 pretrained	MATH	Abstract; Table 1	Table 1

What To Try In 7 Days

Train a small proxy (≈70–160M) on your domain for a few epochs and record per-example loss every few hundred steps.

Cluster loss trajectories (K≈100 using Faiss KMeans) and sample uniformly from each cluster to build a subset for your data budget.

Fine‑tune your production model on the selected subset and compare exact match / ROUGE / BERTScore against random and full-data baselines.

Optimization Features

Infra Optimization

reduces selection cost (proxy ~100× smaller)

System Optimization

use small proxy to shrink selection compute

Training Optimization

data-efficient trainingsubset sampling to reduce epochs and storage

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseMathInstruct: MIT; MIMIC-III: DUA required

Code URLs

https://github.com/BigML-CS-UCLA/S2L

Risks & Boundaries

Limitations

Tested on two domains (mathematics, clinical summarization) only.

Experiments limited to models up to 7B parameters; larger targets untested here.

When Not To Use

When domain data lacks consistent training dynamics across scales (proxy may not reflect target).

If you require selection sensitive to rare but critical examples that clustering may under-represent.

Failure Modes

Proxy model fails to capture target dynamics, producing poor clusters and a low‑quality subset.

Clusters emphasize easier or common topics and under-sample rare but important cases.

Core Entities

Models

Pythia-70MPythia-160MPythia-410MPythia-1BPythia-2.8BPhi-2 (2.7B)Phi-3-MINI (3.8B)LLaMA-2-7BGPT-2 (124M)

Metrics

Exact matchBLEUROUGE-LBERTScore

Datasets

MathInstructMATHGSM8KNumGLUESVAMPSimulEqMIMIC-III

Benchmarks

MATHGSM8KNumGLUESVAMPSimulEq

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

S2L matches full MathInstruct performance using only ~11% of the data.

S2L outperforms other open-source data selection methods on average.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding