Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
Score and pick a few thousand high-quality instruction examples to cut finetuning time and GPU cost while keeping or improving human-facing behavior.
Summary TLDR
The paper introduces INSTRUCTMINING: a lightweight rule that scores instruction–response pairs with simple language indicators (reward-model score, UniEval scores, lengths, perplexity) and then uses BLENDSEARCH to find a small, high-quality subset for finetuning. On LLAMA-2-7B, selecting a few thousand high-scoring examples gives similar or better human-facing behavior versus much larger random sets, reducing finetune time and cost. The work also documents a double-descent effect in finetuning: performance can worsen as you add medium amounts of data and then recover with much larger sets.
Problem Statement
Instruction-tuning helps LLMs follow human prompts, but picking which instruction examples to finetune on is time-consuming and expensive. The paper asks: can we automatically score and select a small, high-quality subset of instruction data to get good finetuned models faster and cheaper?
Main Contribution
INSTRUCTMINING: a linear scoring rule that predicts a finetuned model's inference loss from simple indicators computed on examples
A data-selector pipeline that ranks examples by the rule and uses BLENDSEARCH to find the best subset size
Empirical finding of double descent in instruction finetuning and demonstrations that small, selected subsets (thousands of examples) can match larger training sets on human-facing benchmarks
Key Findings
A small selected subset (≈2,532 examples) produced a competitive finetuned model.
INSTRUCTMINING improved OPENLLM average metric vs base LLAMA-2-7B by about 4.93 points on their tests.
GPT-4 preference evaluation judged INSTRUCTMINING finetuned outputs as equal-or-better than Vicuna-1.5-7B in ~64.7% of cases.
Reward-model score and UniEval signals were the strongest predictors in the linear rule; removing reward raised MT-BENCH loss by 0.051.
Finetuning performance shows non-monotonic behavior (double descent) around ~10k examples.
Results
OPENLLM average metric
GPT-4 pairwise preference (win or tie)
Inference loss (MT-BENCH)
Selected-subset size found by BLENDSEARCH
Who Should Care
What To Try In 7 Days
Compute reward-model and UniEval scores on your instruction pool.
Sort examples by the INSTRUCTMINING-style rule (reward + UniEval signals).
Finetune a small model on the top 1k–5k examples and compare with a random baseline on a held-out evaluation set (e.g., MT-BENCH).
Optimization Features
Infra Optimization
- reduced GPU hours by training on small subsets (examples: 15 min–10 hrs vs 30 hrs)
System Optimization
- treat dataset size as cost in Bayesian search
Training Optimization
- subset selection with BLENDSEARCH
- indicator-based filtering to reduce training size
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments focused on LLAMA-family models; transfer to very different architectures is untested
- Evaluation uses single-turn examples; multi-turn dialogue behavior is not measured
- Rule assumes a linear relationship between indicators and log loss, which may not hold universally
- Selection depends on reward and UniEval models that carry their own biases
When Not To Use
- You already have a large, high-quality human-curated instruction dataset
- Your application requires multi-turn or conversational context not present in the eval set
- You cannot compute the required indicators (reward/UniEval/perplexity) for privacy or cost reasons
Failure Modes
- Selected subset overfits the golden evaluation set and generalizes poorly
- Indicator models (reward / UniEval) misjudge niche domains, leading to mistaken selection
- Double descent: choosing an intermediate data size can reduce performance
Core Entities
Models
- LLAMA-2-7B
- LLAMA-2-13B
- LLAMA-1-7B
- VICUNA-1.5-7B
- STABLEBELUGA-7B
Metrics
- inference loss (on SELF-INSTRUCT and MT-BENCH)
- OPENLLM average metric
- GPT-4 pairwise preference
- ARC / HellaSwag / MMLU / TruthfulQA scores
Datasets
- OpenOrca (OpenOrca-GPT3.5 / GPT4 subsets)
- Dolly-15K
- Alpaca
- Open-Assistant
- StackExchange
- Wikihow
- Self-Instruct
- MT-BENCH
- OPENLLM (ARC, HellaSwag, MMLU, TruthfulQA)
Benchmarks
- LLM-AS-A-JUDGE
- OPENLLM
- MT-BENCH

