Have a strong LLM critique and rewrite your instruction data, then retrain — improves instruction-following.

October 18, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Heng Huang, Jiuxiang Gu, Tianyi Zhou

Links

Abstract / PDF

Why It Matters For Business

You can raise instruction-following quality without larger models by spending on oracle-LM calls to rewrite training data, which often costs less than collecting new human-labeled data and improves model utility quickly.

Summary TLDR

Reflection-tuning uses an oracle LLM (e.g., ChatGPT/GPT-4) to critique and rewrite each instruction-response pair along defined criteria, producing a "recycled" dataset. Models fine-tuned on recycled data (Recycled Alpaca/WizardLM) show big gains on automatic instruction-following benchmarks (AlpacaEval, Open LLM Leaderboard) and improved data quality metrics (lower perplexity, higher coherence). The method is a data-centric, post-hoc pipeline that requires API access to a strong LLM and modest retraining of base models.

Problem Statement

Instruction-tuning quality depends heavily on the quality of instruction-response pairs. Low-quality or inconsistent examples harm model behavior. The paper asks: can an oracle LLM automatically inspect and improve existing instruction data to make instruction-tuning more effective?

Main Contribution

Reflection-Tuning pipeline that uses an oracle LLM to critique and rewrite both instructions and responses under explicit criteria.

Recycled instruction-response datasets for Alpaca and WizardLM and open release of code, data, and models.

Empirical results showing recycled-data models outperform originals on AlpacaEval and Huggingface Open LLM benchmarks, plus analyses of perplexity, coherence, and instruction difficulty.

Key Findings

Recycled Alpaca 7B beats many open-source 7B models on AlpacaEval.

NumbersRecycled Alpaca 7B win rate 76.99% vs Alpaca 7B 26.46%

Recycled WizardLM 7B is top among 7B open models on AlpacaEval.

NumbersRecycled WizardLM 7B win rate 78.88% on AlpacaEval

Recycled data improves instruction–response alignment and lowers model surprisal.

NumbersAlpaca instruction ppl 34.3→13.6; response ppl w/context 49.2→2.9; coherence 0.53→0.67

Results

AlpacaEval win rate

ValueRecycled WizardLM 7B: 78.88%

BaselineWizardLM (original)

AlpacaEval win rate

ValueRecycled Alpaca 7B: 76.99%

BaselineAlpaca 7B: 26.46%

Vicuna test set win rate

ValueRecycled Alpaca 7B: 88.75%; Recycled WizardLM 7B: 81.25%

BaselineOriginal models (same size/data count)

Huggingface Open LLM avg score

ValueRecycled Alpaca 7B: 56.18; Recycled WizardLM 7B: 56.21

BaselineAlpaca 7B: 50.21

Perplexity and coherence (data stats)

ValueAlpaca instr. ppl 34.3→13.6; response ppl w/context 49.2→2.9; coherence 0.53→0.67

BaselineOriginal Alpaca

Who Should Care

What To Try In 7 Days

Pick a 1k–5k subset of your instruction data and run the reflection pipeline with ChatGPT/GPT-4.

Measure changes in perplexity and coherence before and after rewriting using Sentence-BERT and your base model.

Retrain a small 7B model for a few epochs and compare via AlpacaEval or pairwise GPT-4 judgments.

Optimization Features

Training Optimization

  • Data-efficient Training

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Relies on a strong oracle LLM; improvements depend on that model's preferences and biases.
  • Evaluation uses GPT-4/ChatGPT as judge, which may reflect judge-specific tastes rather than humans.
  • Pipeline increases average response length dramatically, which may be undesirable for some applications.

When Not To Use

  • If you lack access or budget for reliable oracle-LM API calls.
  • When you require diverse human voice or highly domain-specific labeling that an oracle may not replicate.
  • If short, concise responses are a hard product constraint (recycled responses tend to be longer).

Failure Modes

  • Overfitting to judge style: model may align to oracle/LMM preferences rather than real users.
  • Loss of diversity: rewriting can make data more uniform and reduce edge-case behaviors.
  • Garbage-in garbage-out: if the oracle LLM is poor on domain content, recycled data will be weak.

Core Entities

Models

  • Llama2-7b
  • Recycled Alpaca 7B
  • Recycled WizardLM 7B
  • Recycled Alpaca 13B

Metrics

  • Win rate
  • Average score
  • Perplexity
  • Coherence (Sentence-BERT)
  • IFD score (Instruction-Following Difficulty)

Datasets

  • Alpaca (52k)
  • WizardLM (subset 70k)
  • AlpacaEval
  • Huggingface Open LLM Leaderboard

Benchmarks

  • AlpacaEval
  • Huggingface Open LLM Leaderboard
  • ARC
  • HellaSwag
  • MMLU
  • TruthfulQA
  • Vicuna test set

Context Entities

Models

  • GPT-4
  • ChatGPT
  • Xwin-LM
  • Vicuna 7B

Metrics

  • Win rate
  • Standard error
  • Average response length

Datasets

  • AlpacaFarm (evaluation)
  • Davinci003 responses (benchmarks)

Benchmarks

  • AlpacaEval leaderboard
  • Huggingface Open LLM Leaderboard