Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

July 12, 20237 min

Overview

Decision SnapshotNeeds Validation

Method uses simple, measurable signals and standard search (BLENDSEARCH); evidence covers multiple datasets and ablations but is limited to LLAMA-family models and specific evaluation sets.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

Links

Abstract / PDF / Data

Why It Matters For Business

Score and pick a few thousand high-quality instruction examples to cut finetuning time and GPU cost while keeping or improving human-facing behavior.

Who Should Care

Summary TLDR

The paper introduces INSTRUCTMINING: a lightweight rule that scores instruction–response pairs with simple language indicators (reward-model score, UniEval scores, lengths, perplexity) and then uses BLENDSEARCH to find a small, high-quality subset for finetuning. On LLAMA-2-7B, selecting a few thousand high-scoring examples gives similar or better human-facing behavior versus much larger random sets, reducing finetune time and cost. The work also documents a double-descent effect in finetuning: performance can worsen as you add medium amounts of data and then recover with much larger sets.

Problem Statement

Instruction-tuning helps LLMs follow human prompts, but picking which instruction examples to finetune on is time-consuming and expensive. The paper asks: can we automatically score and select a small, high-quality subset of instruction data to get good finetuned models faster and cheaper?

Main Contribution

INSTRUCTMINING: a linear scoring rule that predicts a finetuned model's inference loss from simple indicators computed on examples

A data-selector pipeline that ranks examples by the rule and uses BLENDSEARCH to find the best subset size

Key Findings

A small selected subset (≈2,532 examples) produced a competitive finetuned model.

NumbersSelected subset = 2,532 examples (≈2.5% of 100k)

Practical UseYou can often train a useful instruction-following model with a few thousand high-quality examples instead of tens or hundreds of thousands, cutting training time and cost.

Evidence RefTable 3; BlendSearch results

INSTRUCTMINING improved OPENLLM average metric vs base LLAMA-2-7B by about 4.93 points on their tests.

NumbersOPENLLM Avg: LLAMA-2-7B 54.32 → INSTRUCTMINING-Selected (40k) 59.25 (+4.93)

Practical UseScoring and selecting examples can materially raise standard QA/general benchmarks without reworking the model architecture.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
OPENLLM average metric59.25LLAMA-2-7B 54.32+4.93OPENLLM aggregated (ARC, HellaSwag, MMLU, TruthfulQA)Table 4 shows INSTRUCTMINING-Selected (40k) = 59.25 vs LLAMA-2-7B = 54.32Table 4
GPT-4 pairwise preference (win or tie)64.67%VICUNA-1.5-7BLLM-AS-A-JUDGE head-to-headFigure 3 and text: INSTRUCTMINING model equal-or-better vs Vicuna in 64.67% of casesSection 4.2 / Figure 3

What To Try In 7 Days

Compute reward-model and UniEval scores on your instruction pool.

Sort examples by the INSTRUCTMINING-style rule (reward + UniEval signals).

Finetune a small model on the top 1k–5k examples and compare with a random baseline on a held-out evaluation set (e.g., MT-BENCH).

Optimization Features

Infra Optimization
reduced GPU hours by training on small subsets (examples: 15 min–10 hrs vs 30 hrs)
System Optimization
treat dataset size as cost in Bayesian search
Training Optimization
subset selection with BLENDSEARCHindicator-based filtering to reduce training size

Reproducibility

Risks & Boundaries

Limitations

Experiments focused on LLAMA-family models; transfer to very different architectures is untested

Evaluation uses single-turn examples; multi-turn dialogue behavior is not measured

When Not To Use

You already have a large, high-quality human-curated instruction dataset

Your application requires multi-turn or conversational context not present in the eval set

Failure Modes

Selected subset overfits the golden evaluation set and generalizes poorly

Indicator models (reward / UniEval) misjudge niche domains, leading to mistaken selection

Core Entities

Models

LLAMA-2-7BLLAMA-2-13BLLAMA-1-7BVICUNA-1.5-7BSTABLEBELUGA-7B

Metrics

inference loss (on SELF-INSTRUCT and MT-BENCH)OPENLLM average metricGPT-4 pairwise preferenceARC / HellaSwag / MMLU / TruthfulQA scores

Datasets

OpenOrca (OpenOrca-GPT3.5 / GPT4 subsets)Dolly-15KAlpacaOpen-AssistantStackExchangeWikihowSelf-InstructMT-BENCHOPENLLM (ARC, HellaSwag, MMLU, TruthfulQA)

Benchmarks

LLM-AS-A-JUDGEOPENLLMMT-BENCH