Pick fine‑tuning data by clustering loss curves of a small proxy model

March 12, 20247 min

Overview

Decision SnapshotReady For Pilot

S2L is simple and reproducible: collect per-example loss from a small proxy, cluster trajectories, sample across clusters, and fine‑tune. Theory bounds gradient error; experiments show cross-domain gains but testing is limited to math and clinical data and models ≤7B.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

License: MathInstruct: MIT; MIMIC-III: DUA required

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman

Links

Abstract / PDF / Code

Why It Matters For Business

S2L can cut fine‑tuning data by up to ~89% on the evaluated math tasks and halve data/train time in clinical summarization, lowering compute, storage, and labeling costs while keeping or improving accuracy.

Who Should Care

Summary TLDR

S2L records loss values over training for each example on a small, cheap proxy model, clusters those loss trajectories, then draws balanced samples from all clusters to make a fine‑tuning subset for a large model. The paper proves clustered trajectories imply similar gradients and gives a convergence bound for training on the subset. Empirically, S2L matches full-data performance with just 11% of MathInstruct, improves average accuracy over SOTA selection by ~4.7% across six math datasets, gives 32.7% on the hard MATH benchmark from 50K examples (+16.6% vs Phi-2), and improves clinical summarization while cutting data in half. Code is public.

Problem Statement

Supervised fine‑tuning (SFT) for specialized domains is expensive and data‑hungry. Existing selection methods rely on embeddings or confidence from large reference models, which (1) can fail when fine‑tuning data differs from pretraining data and (2) are costly to compute for large models. The paper asks: can a small proxy model's training dynamics identify a small, high‑quality subset that trains large models almost as well?

Main Contribution

S2L algorithm: cluster per-example loss trajectories from a small proxy and uniformly sample across clusters to build a training subset.

Theory: prove examples in the same loss-trajectory cluster have similar gradients and give a bounded-gradient-error convergence guarantee for incremental gradient training on the subset.

Key Findings

S2L matches full MathInstruct performance using only ~11% of the data.

Numbers11% of MathInstruct (~30K of 262K)

Practical UseYou can fine‑tune large models with ~90% less data on similar math tasks and keep model quality.

Evidence RefAbstract; Sec 5.2; Fig.4

S2L outperforms other open-source data selection methods on average.

Numbersavg +4.7% accuracy vs SOTA across 6 datasets

Practical UseSwitching to S2L is likely to give consistent accuracy gains over common selection heuristics.

Evidence RefAbstract; Sec 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyS2L with 30K matches/exceeds full-data trainingFull MathInstruct (262K)match or exceedMathInstruct -> 6 eval datasets (GSM8K, MATH, NumGLUE, SVAMP, Mathematics, SimulEq)Fig.4; Sec 5.2Fig.4
Accuracy32.7% with 50K S2L-selected examplespretrained Phi-2+16.6% vs Phi-2 pretrainedMATHAbstract; Table 1Table 1

What To Try In 7 Days

Train a small proxy (≈70–160M) on your domain for a few epochs and record per-example loss every few hundred steps.

Cluster loss trajectories (K≈100 using Faiss KMeans) and sample uniformly from each cluster to build a subset for your data budget.

Fine‑tune your production model on the selected subset and compare exact match / ROUGE / BERTScore against random and full-data baselines.

Optimization Features

Infra Optimization
reduces selection cost (proxy ~100× smaller)
System Optimization
use small proxy to shrink selection compute
Training Optimization
data-efficient trainingsubset sampling to reduce epochs and storage

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseMathInstruct: MIT; MIMIC-III: DUA required

Risks & Boundaries

Limitations

Tested on two domains (mathematics, clinical summarization) only.

Experiments limited to models up to 7B parameters; larger targets untested here.

When Not To Use

When domain data lacks consistent training dynamics across scales (proxy may not reflect target).

If you require selection sensitive to rare but critical examples that clustering may under-represent.

Failure Modes

Proxy model fails to capture target dynamics, producing poor clusters and a low‑quality subset.

Clusters emphasize easier or common topics and under-sample rare but important cases.

Core Entities

Models

Pythia-70MPythia-160MPythia-410MPythia-1BPythia-2.8BPhi-2 (2.7B)Phi-3-MINI (3.8B)LLaMA-2-7BGPT-2 (124M)

Metrics

Exact matchBLEUROUGE-LBERTScore

Datasets

MathInstructMATHGSM8KNumGLUESVAMPSimulEqMIMIC-III

Benchmarks

MATHGSM8KNumGLUESVAMPSimulEq