TPO: one-step triple-preference finetuning that improves both instruction following and reasoning

May 26, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

1

Authors

Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral

Links

Abstract / PDF

Why It Matters For Business

TPO reduces the need for large supervised fine-tuning datasets and is more robust to noisy preference labels, so you can align models faster and cheaper while keeping or improving reasoning and chat quality.

Summary TLDR

This paper introduces Triple Preference Optimization (TPO), a single-step preference learning method that trains a policy model from three ranked responses per prompt (gold, preferred, rejected). Compared to DPO and recent variants, TPO (and a length-controlled variant TPO-L) improves both instruction-following and reasoning on standard benchmarks (Llama-3 and Mistral setups). Key wins: large accuracy gains on GSM8K and MMLU-Pro, better robustness to noisy judgments, and competitive performance while using less SFT data. TPO-L adds a reward margin to control verbosity.

Problem Statement

Direct Preference Optimization (DPO) and its variants can improve instruction following but often hurt reasoning, need multi-step training or extra reference models, and are fragile to noisy preference labels and dataset size. The paper aims to fix these problems with a single-step, reference-free method that uses three ranked responses per prompt.

Main Contribution

Triple Preference Optimization (TPO): a single-step, reference-free objective that combines behavioral cloning on a gold response with preference pushes/pulls using preferred and rejected responses.

TPO-L: a length-controlled variant that uses a reward margin (average likelihood) to reduce verbosity.

Extensive empirical evaluation showing TPO/TPO-L outperform DPO and variants on instruction-following and reasoning tasks, while being more robust to noisy judgments and needing less SFT data.

Theoretical link: derive TPO from maximum-entropy RL and argue log π(y|x) is an effective implicit reward.

Key Findings

TPO yields large gains on reasoning tasks compared with DPO on small-data settings

NumbersGSM8K: +19.0 pts (5k base), MMLU-Pro: +10.4 pts (5k base)

TPO improves instruction-following and reasoning simultaneously

NumbersArena-Hard: +7.0 pts vs DPO; MixEval-Hard: +12.2 pts (claimed max gains)

TPO is more robust to judgment noise than DPO

NumbersDPO checkpoints collapse to near-zero on some tasks under 100% swapped noise; TPO retains non-zero performance

TPO can match or beat DPO while using less SFT data

NumbersTPO (10k pref data, no extra SFT) GSM8K=52.2 vs DPO (10k SFT + 10k pref) GSM8K=36.7

Using log π(y|x) (sequence likelihood) as implicit reward improves reward modeling over DPO's KL-style term

NumbersTPO shows higher reward accuracy and broader reward margin across data sizes

Results

Accuracy

ValueTPO 51.9%

BaselineDPO 32.9%

Accuracy

ValueTPO 37.5%

BaselineDPO 27.1%

Arena-Hard win rate (5k Base)

ValueTPO 52.1%

BaselineDPO <0.5%

Robustness to judgment noise (40k Base)

ValueTPO: performance degrades but remains non-zero at 100% swapped noise

BaselineDPO: collapsed to near-zero on some benchmarks at 100% swapped noise

Who Should Care

What To Try In 7 Days

Generate 3 responses per prompt from your current SFT or instruction model and rank them to form (gold, preferred, rejected) triples.

Run TPO fine-tuning on a small subset (5k–10k triples) and compare GSM8K and a preferred chat benchmark to your DPO baseline.

If verbosity is a problem, run TPO-L and sweep the reward margin γ (e.g., 0.5, 3, 5.4) to balance length vs quality on a validation set.

Optimization Features

Token Efficiency

  • length control via TPO-L

Training Optimization

  • single-step preference optimization
  • behavioral cloning combined with preference objective

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • TPO assumes availability of three ranked responses per prompt; producing a clean gold response set can be costly.
  • TPO-L requires tuning the reward margin γ; wrong settings can degrade instruction or reasoning performance.
  • The approach was evaluated offline; online/iterative training dynamics are not tested here.
  • Safety and honesty impacts were not deeply explored; multi-objective safety trade-offs remain open.

When Not To Use

  • You cannot produce reliable gold/preferred/rejected triplets for prompts.
  • Your pipeline needs an online RLHF loop rather than offline single-step finetuning.
  • You need strict, proven safety guarantees (TPO's safety effects are not fully studied).

Failure Modes

  • If judged preferences are systematically biased, TPO can still degrade but less catastrophically than DPO.
  • Excessive reward margin γ causes verbosity and lower instruction performance on some benchmarks.
  • Using gold==preferred (no real distinction) reduces benefits and can resemble CPO conflict.

Core Entities

Models

  • Llama-3-8B
  • Llama-3-8B-Instruct
  • Mistral-7B-v0.3
  • Mistral-7B-Instruct

Metrics

  • Accuracy
  • Win rate
  • Average likelihood (log π(y|x))

Datasets

  • UltraFeedback
  • UltraFeedback-ArmoRM
  • Mistral-UltraFeedback-PairRM
  • GSM8K
  • MMLU-Pro
  • MMLU
  • Arena-Hard
  • MT-Bench
  • MixEval-Hard

Benchmarks

  • GSM8K
  • MMLU-Pro
  • MMLU
  • Arena-Hard
  • MT-Bench
  • MixEval-Hard
  • HellaSwag
  • ARC
  • TruthfulQA
  • Winogrande