Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Overview

Decision SnapshotNeeds Validation

Method uses simple, measurable signals and standard search (BLENDSEARCH); evidence covers multiple datasets and ablations but is limited to LLAMA-family models and specific evaluation sets.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

Links

Abstract / PDF / Data

Why It Matters For Business

Score and pick a few thousand high-quality instruction examples to cut finetuning time and GPU cost while keeping or improving human-facing behavior.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

The paper introduces INSTRUCTMINING: a lightweight rule that scores instruction–response pairs with simple language indicators (reward-model score, UniEval scores, lengths, perplexity) and then uses BLENDSEARCH to find a small, high-quality subset for finetuning. On LLAMA-2-7B, selecting a few thousand high-scoring examples gives similar or better human-facing behavior versus much larger random sets, reducing finetune time and cost. The work also documents a double-descent effect in finetuning: performance can worsen as you add medium amounts of data and then recover with much larger sets.

Problem Statement

Instruction-tuning helps LLMs follow human prompts, but picking which instruction examples to finetune on is time-consuming and expensive. The paper asks: can we automatically score and select a small, high-quality subset of instruction data to get good finetuned models faster and cheaper?

Main Contribution

INSTRUCTMINING: a linear scoring rule that predicts a finetuned model's inference loss from simple indicators computed on examples

A data-selector pipeline that ranks examples by the rule and uses BLENDSEARCH to find the best subset size

Key Findings

A small selected subset (≈2,532 examples) produced a competitive finetuned model.

NumbersSelected subset = 2,532 examples (≈2.5% of 100k)

Practical UseYou can often train a useful instruction-following model with a few thousand high-quality examples instead of tens or hundreds of thousands, cutting training time and cost.

Evidence RefTable 3; BlendSearch results

INSTRUCTMINING improved OPENLLM average metric vs base LLAMA-2-7B by about 4.93 points on their tests.

NumbersOPENLLM Avg: LLAMA-2-7B 54.32 → INSTRUCTMINING-Selected (40k) 59.25 (+4.93)

Practical UseScoring and selecting examples can materially raise standard QA/general benchmarks without reworking the model architecture.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
OPENLLM average metric	59.25	LLAMA-2-7B 54.32	+4.93	OPENLLM aggregated (ARC, HellaSwag, MMLU, TruthfulQA)	Table 4 shows INSTRUCTMINING-Selected (40k) = 59.25 vs LLAMA-2-7B = 54.32	Table 4
GPT-4 pairwise preference (win or tie)	64.67%	VICUNA-1.5-7B	—	LLM-AS-A-JUDGE head-to-head	Figure 3 and text: INSTRUCTMINING model equal-or-better vs Vicuna in 64.67% of cases	Section 4.2 / Figure 3

What To Try In 7 Days

Compute reward-model and UniEval scores on your instruction pool.

Sort examples by the INSTRUCTMINING-style rule (reward + UniEval signals).

Finetune a small model on the top 1k–5k examples and compare with a random baseline on a held-out evaluation set (e.g., MT-BENCH).

Optimization Features

Infra Optimization

reduced GPU hours by training on small subsets (examples: 15 min–10 hrs vs 30 hrs)

System Optimization

treat dataset size as cost in Bayesian search

Training Optimization

subset selection with BLENDSEARCHindicator-based filtering to reduce training size

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/Open-Orca/OpenOrca https://github.com/tatsu-lab/stanford_alpaca https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

Risks & Boundaries

Limitations

Experiments focused on LLAMA-family models; transfer to very different architectures is untested

Evaluation uses single-turn examples; multi-turn dialogue behavior is not measured

When Not To Use

You already have a large, high-quality human-curated instruction dataset

Your application requires multi-turn or conversational context not present in the eval set

Failure Modes

Selected subset overfits the golden evaluation set and generalizes poorly

Indicator models (reward / UniEval) misjudge niche domains, leading to mistaken selection

Core Entities

Models

LLAMA-2-7BLLAMA-2-13BLLAMA-1-7BVICUNA-1.5-7BSTABLEBELUGA-7B

Metrics

inference loss (on SELF-INSTRUCT and MT-BENCH)OPENLLM average metricGPT-4 pairwise preferenceARC / HellaSwag / MMLU / TruthfulQA scores

Datasets

OpenOrca (OpenOrca-GPT3.5 / GPT4 subsets)Dolly-15KAlpacaOpen-AssistantStackExchangeWikihowSelf-InstructMT-BENCHOPENLLM (ARC, HellaSwag, MMLU, TruthfulQA)

Benchmarks

LLM-AS-A-JUDGEOPENLLMMT-BENCH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A small selected subset (≈2,532 examples) produced a competitive finetuned model.

INSTRUCTMINING improved OPENLLM average metric vs base LLAMA-2-7B by about 4.93 points on their tests.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding

MedInjection-FR: 571K French biomedical instruction pairs show native data helps most; mixed sources add value

Key finding