Have a small instruction-tuned LLM? Make it a task expert by letting it synthesize its own training data and finetune on it.

July 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Chenyang Zhao, Xueying Jia, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

Links

Abstract / PDF

Why It Matters For Business

You can boost a deployed 7B instruction-tuned LLM for a target task without buying a stronger teacher model or large labeled sets, cutting data costs and legal dependency while improving task accuracy.

Summary TLDR

SELF-GUIDE is a multi-stage recipe: ask the target model to generate many task-specific input-output pairs from a few examples, filter them (noise and length), then finetune the same model on that synthetic dataset. On Super-NaturalInstructions V2 using Vicuna-7b-1.5, SELF-GUIDE raises average Exact Match for classification by +14.5 points and ROUGE‑L for generation by +17.9 points versus prompting. Finetuning on self-generated data beats using the same data in-context by ~20 points on average. Limitations: tested only in English and on a 7B model; misuse risk acknowledged.

Problem Statement

Prompting a large language model often underperforms supervised finetuning, but task-specific labeled data and stronger 'teacher' models are costly or unavailable. Can a model bootstrap itself into a task expert using only a task instruction and a few examples?

Main Contribution

SELF-GUIDE: a practical pipeline where the student LLM self-generates input-output pairs from a few examples, filters them, then finetunes itself.

Empirical evidence that finetuning on self-synthesized data materially improves task accuracy over prompting and few-shot finetuning on the same few gold examples.

A small set of simple filters (noise terms, length-based range) and temperature tuning shown to be important to data quality and final performance.

Key Findings

SELF-GUIDE improves classification Exact Match by ~14.5 absolute points over prompting on evaluated held-out tasks.

NumbersExact Match: baseline 33.2 → SELF-GUIDE 47.7; ∆=+14.5

SELF-GUIDE improves generation quality (ROUGE‑L) by ~17.9 absolute points over prompting on evaluated held-out tasks.

NumbersROUGE-L: baseline 41.6 → SELF-GUIDE 59.4; ∆=+17.9

Finetuning on self-generated examples outperforms using the same examples in-context by a wide margin.

NumbersFinetuning vs Self-ICL: ~+20 absolute points average improvement

SELF-GUIDE makes classification outputs align closer to true label distributions and removes irrelevant answers.

NumbersL1 dist: baseline 1.10 → SELF-GUIDE 0.75; Irrelevant ratio: baseline 0.45 → SELF-GUIDE 0.00

Simple filters matter: removing the ablation (noise) filter drops classification by 4.1%, removing length filter drops generation by 3.7%.

NumbersAblation removal: -4.1% classification; Length removal: -3.7% generation

Results

Exact Match (classification avg)

Value47.7

Baseline33.2 (prompting)

ROUGE-L (generation avg)

Value59.4

Baseline41.6 (prompting)

Finetuning vs Self-ICL (avg improvement)

Value≈20.0

BaselineSelf-ICL (in-context on same synthetic examples)

L1 distance to true label distribution (avg)

Value0.75 (SELF-GUIDE)

Baseline1.10 (raw model)

Who Should Care

What To Try In 7 Days

Pick a target task and 1–3 good examples from your use case.

Use the model to self-generate ~20–60 inputs per conditional label with higher input temperature.

Annotate those inputs with the same model at lower temperature, apply noise and length filters, then finetune briefly (few epochs). Compare to prompting baseline and small-shot fin

Optimization Features

Training Optimization

  • Finetuning on synthetic data
  • Temperature tuning for generation stages

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • All experiments are in English; cross-lingual performance is unknown.
  • Evaluations use a single base model (Vicuna-7b-1.5); behavior on larger or smaller models is untested.
  • Self-generated data quality depends on the base model; poor base models may produce low-quality supervision.
  • Open-source release raises dual-use risks for specialization of harmful capabilities.

When Not To Use

  • When you already have a large, high-quality labeled dataset for the task.
  • For safety-critical tasks that require human-verified labels and traceability.
  • If the base model consistently produces garbage or refuses to answer even in few-shot prompts.

Failure Modes

  • Model learns superficial formatting or label patterns rather than task semantics if synthetic labels are low quality.
  • Synthetic data can be biased or repetitive, causing overfitting to generated artifacts.
  • Filters may fail to catch subtle noise, letting bad examples corrupt finetuning.
  • Limited generalization beyond the evaluated instruction templates or languages.

Core Entities

Models

  • Vicuna-7b-1.5

Metrics

  • Exact Match
  • ROUGE-L

Datasets

  • Super-NaturalInstructions V2

Benchmarks

  • Super-NaturalInstructions V2

Context Entities

Models

  • Self-Instruct (related prior work)

Datasets

  • Natural Instructions V2 (referenced)