Practical comparison of DPO, KTO, IPO and CPO: KTO often wins, small preference sets suffice, instruction tuning helps truthfulness

April 23, 20247 min

Overview

Decision SnapshotNeeds Validation

The study runs many standard benchmarks across three realistic scenarios, so conclusions about relative algorithm behavior are moderately reliable for similar model sizes and datasets.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, Chitta Baral

Links

Abstract / PDF

Why It Matters For Business

You can improve math and truthfulness of mid-sized chat models with preference tuning; use KTO and start with a small (1k–10k) curated preference set to save labeling cost.

Who Should Care

Summary TLDR

This paper compares four RL-free alignment algorithms (DPO, KTO, IPO, CPO) across 13 public benchmarks and three fine-tuning scenarios: (A) SFT then align, (B) align from pretrained, and (C) align from an instruction-tuned base. Main takeaways: KTO usually gives the best aggregate gains, alignment methods improve math and truthfulness more than general reasoning, many methods work well with small preference datasets (1k–10k), and instruction-tuned bases improve truthfulness most. The study uses Mistral variants, UltraChat/UltraFeedback preference data, and MT-Bench / academic benchmarks for evaluation.

Problem Statement

Alignment methods that learn from human preferences are widely used, but questions remain: do RL-free methods need an SFT step, how do different algorithms compare fairly across tasks, how much preference data is required, and where do these methods help or hurt performance?

Main Contribution

A broad empirical comparison of DPO, KTO, IPO and CPO across 13 benchmarks and three fine-tuning scenarios.

Finding that KTO consistently outperforms other RL-free alignment methods in aggregate, with especially strong gains on math problems.

Key Findings

KTO gives the strongest overall gains across tasks and scenarios.

NumbersMistral+KTO GSM8K 42.15 vs Mistral+SFT 26.76 (scenario 2)

Practical UseTry KTO first when optimizing small LLMs on preference data; it often yields the biggest gain, especially on math problems.

Evidence RefTable 9 (scenario 2 GSM8K)

Alignment methods boost math problem solving more than general reasoning.

NumbersGSM8K improved from 26.76 (SFT) to 42.15 (Mistral+KTO) in scenario 2

Practical UseIf your main goal is better math or quantitative QA, preference-based alignment (KTO/DPO/CPO) is worth the effort.

Evidence RefTable 9 (GSM8K)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GSM8K (math)Mistral+KTO 42.15 (scenario 2)Mistral+SFT 26.76+15.39GSM8KTable 9 (scenario 2 GSM8K)Table 9
TruthfulQAMistral+KTO 52.98 (scenario 2)Mistral+SFT 43.73+9.25TruthfulQATable 9 (scenario 2 TruthfulQA)Table 9

What To Try In 7 Days

Run KTO on a 2k–5k preference subset after your SFT baseline and compare GSM8K/TruthfulQA performance.

If truthfulness matters, start from an instruction-tuned base before preference tuning.

Measure MT-Bench or a small human-evaluated subset with GPT-4 to catch dialogue regressions early.

Optimization Features

Training Optimization
recommend small preference set (1k–10k)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments focus on Mistral 7B variants; results may not transfer to much larger or different architectures.

Preference datasets and many benchmarks are specific to chat tasks (UltraChat/UltraFeedback), limiting generality to other domains.

When Not To Use

If you need broad reasoning gains across many tasks—alignment methods gave limited improvements in general reasoning.

If you cannot afford human preference curation—preference datasets remain costly to prepare.

Failure Modes

Overfitting when using large preference sets after SFT; performance often drops versus small subsets.

Domain regressions: gains in one domain (math/truthfulness) can reduce ability in others (dialogue or specific MT-Bench slices).

Core Entities

Models

DPOKTOIPOCPOMistral-7B-v0.1Mistral-Instruct-7B-v0.2SFT

Metrics

AccuracyTruthfulQA scoreMT-Bench GPT-4 score (0-10)

Datasets

UltraFeedback-binarizedUltraChatGSM8KTruthfulQAMMLUMT-BenchBig BenchOpen LLM Leaderboard

Benchmarks

MT-BenchGSM8KTruthfulQAMMLUARCHellaSwagWinograndeOpenBookQABoolQPIQABig Bench subsets