Data-efficient Training Papers — Parsed & Scored for Practitioners

One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

0.60

0.50

24

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Key finding

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers: 5.3%–23% relative gains on six datasets (one-stage vs two-stage)

Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

0.70

0.60

0.80

15

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Key finding

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

Numbers: Generated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

Cut domain-specific annotation cost: mix a small set of human labels with many GPT-3.5 labels using smart sampling and prompt retrieval

0.60

0.80

4

IMFL cuts expert labeling costs by replacing many expensive human labels with cheaper LLM labels while keeping near-human performance on key domain tasks, enabling faster, cheaper domain model launches.

Key finding

Mixing 200 human labels with 800 GPT-3.5 labels (IMFL) outperforms using 600 human labels (3×) on four domain tasks.

Numbers: FPB +4.72 F1; Headline +6.96 F1; PubMedQA +3.61 acc; MedQA +9.67 acc

DistilDP: use a DP-finetuned teacher to generate private synthetic text and distill a compact student without applying DP twice

0.60

2

DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.

Key finding

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

Numbers: Big Patent: DistilDP PPL 32.43 vs DP-SGD student 41.8 (−9.37 PPL)

Use small synthetic QA datasets and a PPL curriculum to boost Chinese and scientific reasoning in Llama‑3 with ~100B CPT tokens

0.60

2

You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.

Key finding

C-Eval (Chinese) improved by 8.81 points after CPT.

Numbers: C‑Eval: 49.43 → 58.24 (+8.81)

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

0.60

0.70

2

Score and pick a few thousand high-quality instruction examples to cut finetuning time and GPU cost while keeping or improving human-facing behavior.

Key finding

A small selected subset (≈2,532 examples) produced a competitive finetuned model.

Numbers: Selected subset = 2,532 examples (≈2.5% of 100k)

Fine-tuning Llama 3 8B on translation memories improves translations — gains appear reliably once you have ~5k in-domain examples

0.60

0.30

0.70

1

Fine-tuning a midsize LLM on your own translation memories can give big, focused quality gains — especially for low-resource languages — but only if you have enough in-domain data (roughly ≥5k examples).

Key finding

Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.

Numbers: avg BLEU +13.7; avg COMET +25 (100k+ vs baseline)

ISARA: iteratively self-align an LLM using retrieval-augmented in-context learning and <100 seed examples

0.60

0.70

1

You can improve model safety and truthfulness in new domains with very small labeled seeds and no extra human rules or reward models, cutting annotation cost and speeding deployment.

Key finding

ISARA can sharply reduce harmful outputs on safety prompts.

Numbers: LLaMA-7B harmful rate discrimination: 37.6% → 1.2% (pretrain → ISARA)

Use self-distillation plus asymmetric sub-4-bit quantization to get practical 2–3 bit LLMs

0.60

0.85

1

BitDistiller makes deploying 2–3 bit LLMs practical: it keeps much of reasoning/code accuracy while slashing quantization time and GPU cost, enabling cheaper on-prem or edge inference.

Key finding

BitDistiller yields better language modeling and QA accuracy than prior PTQ and QAT on LLaMA-2-7B.

Numbers: 2-bit g128: MMLU 29.25 vs LLM-QAT 23.62 (Table 1)

Pick fine‑tuning data by clustering loss curves of a small proxy model

0.70

0.60

0.80

1

S2L can cut fine‑tuning data by up to ~89% on the evaluated math tasks and halve data/train time in clinical summarization, lowering compute, storage, and labeling costs while keeping or improving accuracy.

Key finding

S2L matches full MathInstruct performance using only ~11% of the data.

Numbers: 11% of MathInstruct (~30K of 262K)

Have a strong LLM critique and rewrite your instruction data, then retrain — improves instruction-following.

0.60

0.50

1

You can raise instruction-following quality without larger models by spending on oracle-LM calls to rewrite training data, which often costs less than collecting new human-labeled data and improves model utility quickly.

Key finding

Recycled Alpaca 7B beats many open-source 7B models on AlpacaEval.

Numbers: Recycled Alpaca 7B win rate 76.99% vs Alpaca 7B 26.46%

Cut training cost for vision transformers by combining attention-based data selection and two-step sparsity pruning.

0.20

0.50

0

You can potentially cut training time and compute by pruning modestly (e.g., 30%) and training on attention-selected data. This reduces iteration time for product experiments and lowers cloud/GPU costs if accuracy loss is acceptable.

Key finding

One-shot magnitude pruning at 30% sparsity kept model accuracy high.

Numbers: Accuracy ≈ 79% at 30% sparsity (CIFAR-10)

Lillama: one‑shot, low‑rank feature distillation to shrink LLMs fast on one A100

0.75

0.60

0.80

0

Lillama lets teams cut model size and GPU memory quickly with only millions of calibration tokens, enabling cheaper deployment and larger context windows without large retraining costs.

Key finding

Large model compression with small calibration data retains most performance.

Numbers: Mixtral‑8x7B: 20% → average 96% of base; Phi‑3 14B: 20% → 97% of base

Make tiny weighted training sets by clustering latents and decoding with diffusion — provably consistent.

0.70

0.60

0.70

0

You can cut data storage and training compute by replacing large datasets with a small set of weighted synthetic images decoded from latent clusters, while preserving accuracy when you use a good latent diffusion prior.

Key finding

Optimal quantization in latent space pushes forward to consistent approximations in image space.

Numbers: Convergence rate O(K^{-1/d}) (Corollary 1).

Automated candidate labeling plus query-only PEFT lets you adapt search per tenant without re-indexing

0.70

0.60

0.80

0

You can adapt search models per tenant without re-indexing documents, cutting compute, operational risk, and per-tenant storage by orders of magnitude while keeping retrieval quality close to joint fine-tuning.

Key finding

Automated benchmark with dense relevance per query

Numbers: 291 train / 92 test queries; mean 13.61 golden chunks/query (σ=21.41)

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

0.60

0.50

0.60

0

Cut fine-tuning cost by selecting a small, high-value subset (5–15%) that preserves or improves model quality and reduces training time.

Key finding

GRADFILTERING-selected 5–15% subsets match or outperform Random and Superfiltering in most judged cases.

Numbers: 19/24 LLM-as-a-judge cases

A 1,000+ task environment and benchmark that shows training on many verifiable tasks boosts LLM reasoning and efficiency

0.60

0

Training on many verifiable tasks yields broader reasoning, faster RL training, and better generalization—useful for robust assistants, automated QA, and data-synthesis pipelines.

Key finding

INTERNBOOTCAMP supports 1000+ reasoning task classes and the authors used a core set of 704 tasks for experiments.

Numbers: 1000+ tasks total; 704 tasks retained for experiments

Recover lost accuracy in corrupted small LMs by training tiny LoRA adapters with synthetic data and logit distillation

0.60

0.70

0

Deployments can silently corrupt weights during conversion or serialization. Recover-LoRA offers a low-cost way to restore accuracy without labeled data or full retraining, saving time and lowering risk for edge and on-device models.

Key finding

Recover-LoRA recovered non-zero accuracy on three of four tested SLMs.

Numbers: AR% = +17.24 (AMD-OLMO-SFT 1B), +13.38 (Llama3.2 1B), +4.95 (DeepSeek-R1 1.5B)

DrICL: tune objectives and reweight noisy demonstrations to stabilize many-shot in‑context learning

0.60

0.45

0

If you push LLMs to use hundreds of examples, performance can fall; DrICL stabilizes many-shot behavior and reduces variability across tasks, so production systems that batch many examples (search reranking, large retrieval contexts, document clustering) get more predictable results.

Key finding

DrICL yields lower cross-dataset performance variance than baselines.

Numbers: variance avg DrICL=1.56e-03 vs MetaICL=2.38e-03 (Table 7)

UrduLLaMA 1.0: fine-tuning LLaMA-3.1 for Urdu with 128M tokens and LoRA

0.40

0.50

0.60

0

Targeted continual pretraining plus LoRA fine-tuning can give large in-domain translation gains with modest compute, enabling localized Urdu services without training from scratch.

Key finding

UrduLLaMA 1.0 raises in-house MT BLEU from 10.87 to 28.01.

Numbers: BLEU 28.01 vs 10.87 (Table 6)

DUAL: pick samples that are both representative and uncertain to label fewer summaries more effectively

0.60

0.65

0

DUAL cuts labeling waste by choosing representative but model-informative documents, improving robustness and lowering selection compute compared to full uncertainty methods.

Key finding

DUAL frequently matches or yields the best ROUGE-1 among compared strategies on evaluated benchmarks.

Numbers: FLAN-T5 AESLC Iter15: DUAL R1=35.57 vs Random 35.51 (Table B2)

DPO + generated trajectories: train recommender RL agents with very little human data and short compute

0.25

0.45

0.60

0

You can cut expensive human trajectory collection and short-run compute by using DPO plus synthetic rollouts, letting teams prototype RL-based recommenders faster and cheaper in simulation.

Key finding

DPO trains faster and achieves higher task performance than PPO in the WebShop simulator under short training budgets.

Numbers: DPO ~19% success after ~3000 steps/30–60 min vs PPO ~15% after 2 hours

Fine-tune on the model's own correct answers to avoid forgetting and keep generality

0.60

0.40

0

SSR lets you specialize a model for a task without erasing its existing skills, reducing risk when deploying fine-tuned LLMs across multiple use cases.

Key finding

SSR sharply reduces catastrophic forgetting on broad benchmarks compared to SFT.

Numbers: SFT avg drop -16.7% vs SSR -2.3% (trained on MD2D) on MMLU/TruthfulQA/GSM8k/Hellaswag

RIRO: reshape inputs then refine outputs to boost LLMs on tiny domain datasets

0.50

0.60

0

RIRO improves LLM output quality when labeled data is scarce, reducing manual test-writing and correction time while keeping compute costs lower via QLoRA.

Key finding

Full RIRO pipeline increases BLEU from 0.55 (Phi-2 baseline) to 0.72

Numbers: BLEU: Phi-2 0.55 → RIRO 0.72