Add a positive log‑likelihood term to DPO to stop it from reducing the probability of preferred answers

February 20, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

6

Authors

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Links

Abstract / PDF

Why It Matters For Business

If you fine‑tune models with pairwise preference data, standard DPO can unintentionally degrade correct outputs; DPOP is a low‑cost fix that yields more reliable improvements and better leaderboard scores.

Summary TLDR

DPO (Direct Preference Optimisation) can lower the model likelihood of preferred completions when the preferred and dispreferred pair differ by few tokens. The authors prove the token-level failure mode, show it empirically on new paired datasets (MetaMath, ARC, HellaSwag), and propose DPO‑Positive (DPOP): a simple extra penalty that keeps preferred-completion likelihoods high. DPOP prevents the token-level collapse, improves downstream task scores, and was used to build Smaug models (Smaug‑34B, Smaug‑72B). Smaug‑72B hits 80.48% average on the HuggingFace Open LLM Leaderboard and DPOP outperforms DPO on MT‑Bench in controlled comparisons.

Problem Statement

Standard DPO optimises relative preference but can reduce the absolute probability of the preferred completion. This is especially likely when preference pairs differ by only a few tokens, causing later-token probabilities to drop and degrading task accuracy.

Main Contribution

Theoretical proof that DPO can decrease preferred-completion likelihood while improving relative preference.

DPO‑Positive (DPOP): a modified loss that penalises lowering preferred-completion likelihood and fixes the token-level failure.

New paired preference datasets derived from MetaMath, ARC, and HellaSwag and empirical token-level analyses.

Smaug models (7B/34B/72B) fine-tuned with DPOP; Smaug‑72B reaches 80.48% on the HuggingFace Open LLM Leaderboard.

Key Findings

DPO can reduce the model log‑prob of preferred completions on low edit‑distance pairs

Numbers-1.82 vs -0.26 vs -0.37 log-prob (DPO vs DPOP vs ref) on tokens after edit (MetaMath)

DPOP improves across low and high edit‑distance datasets compared to DPO and alternatives

NumbersDPOP outperforms DPO/IPO/SLiC on MetaMath (6.5% edit) and ARC (90% edit)

DPOP gives measurable downstream gains on an independent, LLM‑judged benchmark

NumbersMT‑Bench first‑turn score: DPO 7.032±0.043 vs DPOP 7.292±0.037

Smaug‑72B (DPOP fine‑tuned) reached top open‑weight leaderboard numbers

Numbers80.48% average accuracy on HuggingFace Open LLM Leaderboard

Results

Accuracy

Value80.48%

BaselineMoMo-72B-lora-1.8.7-DPO (78.55%)

MT-Bench first-turn score (Llama-2-7B finetune)

ValueDPOP 7.292 ± 0.037

BaselineDPO 7.032 ± 0.043

Token log-prob after edit (preferred completion)

ValueDPO -1.82, DPOP -0.26, reference -0.37 (average)

Baselinereference model

Normalized edit distance

ValueMetaMath 6.5%, ARC 90%

Who Should Care

What To Try In 7 Days

Run a small DPO vs DPOP fine‑tune on an existing paired dataset and compare token log‑probs after edits.

If you use DPO in production tuning, add the DPOP penalty (λ) and test general benchmarks like MT‑Bench.

Convert an important labelled dataset into paired preferences (small edit pairs) and validate DPOP prevents collapse.

Optimization Features

Training Optimization

  • adds a loss penalty term to preserve preferred-completion likelihood

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only six English datasets were evaluated; non‑English behaviour is untested.
  • Full ablation at 72B scale was not done due to compute limits; some scale assumptions are extrapolated from smaller models.
  • DPOP hyperparameter λ was lightly tuned (default λ=50 used for Smaug) and may need retuning for different bases.

When Not To Use

  • You do not need DPOP if you never fine‑tune on paired preference data.
  • If your preference pairs are synthetic and guaranteed to be far apart in text and you already validated DPO, gains may be smaller.

Failure Modes

  • DPO can decrease the likelihood of the preferred completion while still increasing relative preference, especially when pairs differ by few tokens.
  • Token-level 'wrong-way' gradients: tokens after the first differing token can see reduced log‑prob under DPO, breaking autoregressive modeling.

Core Entities

Models

  • Smaug-72B
  • Smaug-34B
  • Smaug-7B
  • LoRA
  • Bagel-34B-v0.2
  • Llama-2-7B-Chat
  • Mistral7B
  • Yi-34B-200k
  • Qwen72B

Metrics

  • Accuracy
  • MT-Bench first-turn score
  • token log-prob change
  • normalized edit distance

Datasets

  • MetaMath (paired)
  • ARC-Challenge (paired)
  • HellaSwag (paired)
  • ORCA DPO
  • Truthy DPO
  • UltraFeedback_binarized
  • GSM8K
  • MT-Bench
  • HuggingFace Open LLM Leaderboard

Benchmarks

  • HuggingFace Open LLM Leaderboard
  • MT-Bench
  • MMLU
  • GSM8K
  • ARC
  • HellaSwag
  • TruthfulQA
  • Winogrande