SPIN: let a supervised-finetuned LLM play against itself to improve without new human labels

January 2, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

11

Authors

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu

Links

Abstract / PDF

Why It Matters For Business

SPIN can raise model quality using only existing supervised labels, cutting cost for collecting preference labels while needing extra compute to generate synthetic data.

Summary TLDR

SPIN is a simple iterative fine-tuning recipe that lets a supervised-finetuned model generate synthetic responses and then fine-tunes itself to prefer human responses over its own. Starting from zephyr-7b-sft-full, SPIN used only subsets of the existing Ultrachat200k SFT data plus self-generated replies and raised leaderboard scores (Open LLM Leaderboard average 58.14 → 63.16) and MT-Bench (5.94 → 6.78) without extra human preferences. SPIN is computationally heavier (adds synthetic-data generation) but reduces the need to collect preference labels.

Problem Statement

How to turn a weak but supervised-finetuned LLM into a stronger model without collecting any new human preference labels or using stronger LLMs as teachers?

Main Contribution

A self-play fine-tuning method (SPIN) where the model generates synthetic responses and trains a next-iteration model to prefer human responses over the previous iteration's responses.

A theoretical proof showing the method's objective is minimized only when the model distribution matches the target (human) data distribution.

Empirical results on Open LLM Leaderboard, MT-Bench and BigBench showing SPIN improves a SFT checkpoint and matches or exceeds DPO that used extra GPT-4 preference data.

Key Findings

SPIN raises average Open LLM Leaderboard score starting from a SFT checkpoint.

Numbers58.14 → 63.16 average (Open LLM Leaderboard)

Large gains on specific tasks: math and truthfulness improved strongly.

NumbersGSM8k 26.76 → 38.97; TruthfulQA 43.73 → 54.90 (absolute points)

SPIN matches or beats DPO that used ~62k GPT-4 preferences.

NumbersDPO average 61.31 vs SPIN iter0 60.80, SPIN iter1 62.12

MT-Bench improved measurably with SPIN.

Numbers5.94 → 6.78 (MT-Bench average)

Results

Open LLM Leaderboard average

Value63.16

Baseline58.14 (zephyr-7b-sft-full)

Accuracy

Value38.97

Baseline26.76

TruthfulQA score

Value54.90

Baseline43.73

MT-Bench average

Value6.78

Baseline5.94

Comparison with DPO (average)

ValueSPIN iter1 62.12 vs DPO 61.31

Baselinezephyr-7b-dpo-full 61.31

Who Should Care

What To Try In 7 Days

Run SPIN iter0: sample 50k SFT prompts, generate replies, fine-tune the SFT checkpoint for 1–2 epochs and compare benchmarks.

Compare SPIN iter0 vs continued SFT epochs to confirm SPIN breaks SFT plateau.

If you have small preference budgets, run SPIN first and add DPO on top to see if combined gains beat direct preference collection.

Optimization Features

System Optimization

  • Uses DeepSpeed ZeRO-3 and FlashAttention-2 to lower training cost

Training Optimization

  • Iterative fine-tuning
  • Self-generated synthetic data
  • KL regularization to stabilize updates

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SPIN learns toward a fixed human target distribution, which sets an upper performance ceiling.
  • Requires extra compute and latency to generate synthetic data each iteration (reported ~1.45h gen + 4–8.6h training per iteration for 50k examples on 8xA100).
  • Synthetic data comes from the model itself and can propagate existing biases or errors.

When Not To Use

  • When you already have high-quality, plentiful human preference data—direct preference methods may be more efficient.
  • When the desired target behavior changes dynamically; SPIN assumes a fixed target distribution.
  • When compute budget is too tight to generate and train on synthetic examples.

Failure Modes

  • Convergence to the human SFT distribution limits further gains (ceiling effect).
  • Model may amplify its own mistakes if the SFT data or early synthetic outputs are low quality.
  • Excessive update steps without KL regularization could destabilize training (authors use KL term to avoid this).

Core Entities

Models

  • SFT
  • Mistral-7B
  • zephyr-7b-dpo-full
  • vicuna-13b-v1.5

Metrics

  • Average leaderboard score (Open LLM Leaderboard)
  • Accuracy
  • TruthfulQA score
  • MT-Bench average

Datasets

  • Ultrachat200k (HuggingFaceH4/ultrachat_200k)
  • UltraFeedback Binarized (ultrafeedback_binarized, ~62k prefs)
  • Open LLM Leaderboard (suite)
  • MT-Bench
  • GSM8k
  • TruthfulQA
  • Big-Bench-Hard subsets

Benchmarks

  • Open LLM Leaderboard
  • MT-Bench
  • Big-Bench-Hard