Overview
S2L is simple and reproducible: collect per-example loss from a small proxy, cluster trajectories, sample across clusters, and fine‑tune. Theory bounds gradient error; experiments show cross-domain gains but testing is limited to math and clinical data and models ≤7B.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
License: MathInstruct: MIT; MIMIC-III: DUA required
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
S2L can cut fine‑tuning data by up to ~89% on the evaluated math tasks and halve data/train time in clinical summarization, lowering compute, storage, and labeling costs while keeping or improving accuracy.
Who Should Care
Summary TLDR
S2L records loss values over training for each example on a small, cheap proxy model, clusters those loss trajectories, then draws balanced samples from all clusters to make a fine‑tuning subset for a large model. The paper proves clustered trajectories imply similar gradients and gives a convergence bound for training on the subset. Empirically, S2L matches full-data performance with just 11% of MathInstruct, improves average accuracy over SOTA selection by ~4.7% across six math datasets, gives 32.7% on the hard MATH benchmark from 50K examples (+16.6% vs Phi-2), and improves clinical summarization while cutting data in half. Code is public.
Problem Statement
Supervised fine‑tuning (SFT) for specialized domains is expensive and data‑hungry. Existing selection methods rely on embeddings or confidence from large reference models, which (1) can fail when fine‑tuning data differs from pretraining data and (2) are costly to compute for large models. The paper asks: can a small proxy model's training dynamics identify a small, high‑quality subset that trains large models almost as well?
Main Contribution
S2L algorithm: cluster per-example loss trajectories from a small proxy and uniformly sample across clusters to build a training subset.
Theory: prove examples in the same loss-trajectory cluster have similar gradients and give a bounded-gradient-error convergence guarantee for incremental gradient training on the subset.
Key Findings
S2L matches full MathInstruct performance using only ~11% of the data.
S2L outperforms other open-source data selection methods on average.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | S2L with 30K matches/exceeds full-data training | Full MathInstruct (262K) | match or exceed | MathInstruct -> 6 eval datasets (GSM8K, MATH, NumGLUE, SVAMP, Mathematics, SimulEq) | Fig.4; Sec 5.2 | Fig.4 |
| Accuracy | 32.7% with 50K S2L-selected examples | pretrained Phi-2 | +16.6% vs Phi-2 pretrained | MATH | Abstract; Table 1 | Table 1 |
What To Try In 7 Days
Train a small proxy (≈70–160M) on your domain for a few epochs and record per-example loss every few hundred steps.
Cluster loss trajectories (K≈100 using Faiss KMeans) and sample uniformly from each cluster to build a subset for your data budget.
Fine‑tune your production model on the selected subset and compare exact match / ROUGE / BERTScore against random and full-data baselines.
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Tested on two domains (mathematics, clinical summarization) only.
Experiments limited to models up to 7B parameters; larger targets untested here.
When Not To Use
When domain data lacks consistent training dynamics across scales (proxy may not reflect target).
If you require selection sensitive to rare but critical examples that clustering may under-represent.
Failure Modes
Proxy model fails to capture target dynamics, producing poor clusters and a low‑quality subset.
Clusters emphasize easier or common topics and under-sample rare but important cases.

