Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

July 12, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

Links

Abstract / PDF

Why It Matters For Business

Score and pick a few thousand high-quality instruction examples to cut finetuning time and GPU cost while keeping or improving human-facing behavior.

Summary TLDR

The paper introduces INSTRUCTMINING: a lightweight rule that scores instruction–response pairs with simple language indicators (reward-model score, UniEval scores, lengths, perplexity) and then uses BLENDSEARCH to find a small, high-quality subset for finetuning. On LLAMA-2-7B, selecting a few thousand high-scoring examples gives similar or better human-facing behavior versus much larger random sets, reducing finetune time and cost. The work also documents a double-descent effect in finetuning: performance can worsen as you add medium amounts of data and then recover with much larger sets.

Problem Statement

Instruction-tuning helps LLMs follow human prompts, but picking which instruction examples to finetune on is time-consuming and expensive. The paper asks: can we automatically score and select a small, high-quality subset of instruction data to get good finetuned models faster and cheaper?

Main Contribution

INSTRUCTMINING: a linear scoring rule that predicts a finetuned model's inference loss from simple indicators computed on examples

A data-selector pipeline that ranks examples by the rule and uses BLENDSEARCH to find the best subset size

Empirical finding of double descent in instruction finetuning and demonstrations that small, selected subsets (thousands of examples) can match larger training sets on human-facing benchmarks

Key Findings

A small selected subset (≈2,532 examples) produced a competitive finetuned model.

NumbersSelected subset = 2,532 examples (≈2.5% of 100k)

INSTRUCTMINING improved OPENLLM average metric vs base LLAMA-2-7B by about 4.93 points on their tests.

NumbersOPENLLM Avg: LLAMA-2-7B 54.32 → INSTRUCTMINING-Selected (40k) 59.25 (+4.93)

GPT-4 preference evaluation judged INSTRUCTMINING finetuned outputs as equal-or-better than Vicuna-1.5-7B in ~64.7% of cases.

NumbersWin-or-tie rate = 64.67% (GPT-4 head-to-head)

Reward-model score and UniEval signals were the strongest predictors in the linear rule; removing reward raised MT-BENCH loss by 0.051.

NumbersAblation: remove Rew → Loss(MT-BENCH) ↑ 0.051

Finetuning performance shows non-monotonic behavior (double descent) around ~10k examples.

NumbersPerformance worsened after data grew to ≈10,000 then improved with larger sizes (plots in Figures 1,4)

Results

OPENLLM average metric

Value59.25

BaselineLLAMA-2-7B 54.32

GPT-4 pairwise preference (win or tie)

Value64.67%

BaselineVICUNA-1.5-7B

Inference loss (MT-BENCH)

Value0.711

BaselineRandom 1,000 from OpenOrca = 0.746

Selected-subset size found by BLENDSEARCH

Value2,532

Baselinemanual top-K

Who Should Care

What To Try In 7 Days

Compute reward-model and UniEval scores on your instruction pool.

Sort examples by the INSTRUCTMINING-style rule (reward + UniEval signals).

Finetune a small model on the top 1k–5k examples and compare with a random baseline on a held-out evaluation set (e.g., MT-BENCH).

Optimization Features

Infra Optimization

  • reduced GPU hours by training on small subsets (examples: 15 min–10 hrs vs 30 hrs)

System Optimization

  • treat dataset size as cost in Bayesian search

Training Optimization

  • subset selection with BLENDSEARCH
  • indicator-based filtering to reduce training size

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focused on LLAMA-family models; transfer to very different architectures is untested
  • Evaluation uses single-turn examples; multi-turn dialogue behavior is not measured
  • Rule assumes a linear relationship between indicators and log loss, which may not hold universally
  • Selection depends on reward and UniEval models that carry their own biases

When Not To Use

  • You already have a large, high-quality human-curated instruction dataset
  • Your application requires multi-turn or conversational context not present in the eval set
  • You cannot compute the required indicators (reward/UniEval/perplexity) for privacy or cost reasons

Failure Modes

  • Selected subset overfits the golden evaluation set and generalizes poorly
  • Indicator models (reward / UniEval) misjudge niche domains, leading to mistaken selection
  • Double descent: choosing an intermediate data size can reduce performance

Core Entities

Models

  • LLAMA-2-7B
  • LLAMA-2-13B
  • LLAMA-1-7B
  • VICUNA-1.5-7B
  • STABLEBELUGA-7B

Metrics

  • inference loss (on SELF-INSTRUCT and MT-BENCH)
  • OPENLLM average metric
  • GPT-4 pairwise preference
  • ARC / HellaSwag / MMLU / TruthfulQA scores

Datasets

  • OpenOrca (OpenOrca-GPT3.5 / GPT4 subsets)
  • Dolly-15K
  • Alpaca
  • Open-Assistant
  • StackExchange
  • Wikihow
  • Self-Instruct
  • MT-BENCH
  • OPENLLM (ARC, HellaSwag, MMLU, TruthfulQA)

Benchmarks

  • LLM-AS-A-JUDGE
  • OPENLLM
  • MT-BENCH