Add a positive log‑likelihood term to DPO to stop it from reducing the probability of preferred answers

Overview

Decision SnapshotReady For Pilot

The paper gives both a mathematical derivation and multiple empirical tests showing failure of DPO and improvement from DPOP; results include token‑level diagnostics and independent benchmark gains.

Citations6

Evidence Strength0.80

Confidence0.85

Risk Signals7

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you fine‑tune models with pairwise preference data, standard DPO can unintentionally degrade correct outputs; DPOP is a low‑cost fix that yields more reliable improvements and better leaderboard scores.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

DPO (Direct Preference Optimisation) can lower the model likelihood of preferred completions when the preferred and dispreferred pair differ by few tokens. The authors prove the token-level failure mode, show it empirically on new paired datasets (MetaMath, ARC, HellaSwag), and propose DPO‑Positive (DPOP): a simple extra penalty that keeps preferred-completion likelihoods high. DPOP prevents the token-level collapse, improves downstream task scores, and was used to build Smaug models (Smaug‑34B, Smaug‑72B). Smaug‑72B hits 80.48% average on the HuggingFace Open LLM Leaderboard and DPOP outperforms DPO on MT‑Bench in controlled comparisons.

Problem Statement

Standard DPO optimises relative preference but can reduce the absolute probability of the preferred completion. This is especially likely when preference pairs differ by only a few tokens, causing later-token probabilities to drop and degrading task accuracy.

Main Contribution

Theoretical proof that DPO can decrease preferred-completion likelihood while improving relative preference.

DPO‑Positive (DPOP): a modified loss that penalises lowering preferred-completion likelihood and fixes the token-level failure.

Key Findings

DPO can reduce the model log‑prob of preferred completions on low edit‑distance pairs

Numbers-1.82 vs -0.26 vs -0.37 log-prob (DPO vs DPOP vs ref) on tokens after edit (MetaMath)

Practical UseIf your preference pairs differ by few tokens (e.g., minor arithmetic fix), standard DPO can make the model worse; use a corrective loss like DPOP.

Evidence RefFig.4 left; Sec.5 token-level analysis

DPOP improves across low and high edit‑distance datasets compared to DPO and alternatives

NumbersDPOP outperforms DPO/IPO/SLiC on MetaMath (6.5% edit) and ARC (90% edit)

Practical UseReplace DPO with DPOP for preference fine‑tuning to gain robustness whether pairs are very similar or very different.

Evidence RefFig.2; Sec.5.2 loss comparisons

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80.48%	MoMo-72B-lora-1.8.7-DPO (78.55%)	+1.93%	HuggingFace Open LLM Leaderboard (aggregate of six tasks)	Table 1; Sec.6.2	Table 1
MT-Bench first-turn score (Llama-2-7B finetune)	DPOP 7.292 ± 0.037	DPO 7.032 ± 0.043	+0.260	MT-Bench (10 trials)	Sec.6.1 MT-Bench	Sec.6.1

What To Try In 7 Days

Run a small DPO vs DPOP fine‑tune on an existing paired dataset and compare token log‑probs after edits.

If you use DPO in production tuning, add the DPOP penalty (λ) and test general benchmarks like MT‑Bench.

Convert an important labelled dataset into paired preferences (small edit pairs) and validate DPOP prevents collapse.

Optimization Features

Training Optimization

adds a loss penalty term to preserve preferred-completion likelihood

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/abacusai/smaug

Data URLs

https://huggingface.co/datasets/Intel/orca_dpo_pairs https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleanedMetaMath/ARC/HellaSwag paired datasets (linked in repository)

Risks & Boundaries

Limitations

Only six English datasets were evaluated; non‑English behaviour is untested.

Full ablation at 72B scale was not done due to compute limits; some scale assumptions are extrapolated from smaller models.

When Not To Use

You do not need DPOP if you never fine‑tune on paired preference data.

If your preference pairs are synthetic and guaranteed to be far apart in text and you already validated DPO, gains may be smaller.

Failure Modes

DPO can decrease the likelihood of the preferred completion while still increasing relative preference, especially when pairs differ by few tokens.

Token-level 'wrong-way' gradients: tokens after the first differing token can see reduced log‑prob under DPO, breaking autoregressive modeling.

Core Entities

Models

Smaug-72BSmaug-34BSmaug-7BLoRABagel-34B-v0.2Llama-2-7B-ChatMistral7BYi-34B-200kQwen72B

Metrics

AccuracyMT-Bench first-turn scoretoken log-prob changenormalized edit distance

Datasets

MetaMath (paired)ARC-Challenge (paired)HellaSwag (paired)ORCA DPOTruthy DPOUltraFeedback_binarizedGSM8KMT-BenchHuggingFace Open LLM Leaderboard

Benchmarks

HuggingFace Open LLM LeaderboardMT-BenchMMLUGSM8KARCHellaSwagTruthfulQAWinogrande

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DPO can reduce the model log‑prob of preferred completions on low edit‑distance pairs

DPOP improves across low and high edit‑distance datasets compared to DPO and alternatives

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

APEMO: reallocate compute to negative peaks and endings to stabilize long-horizon agent workflows

Key finding

Practical comparison of DPO, KTO, IPO and CPO: KTO often wins, small preference sets suffice, instruction tuning helps truthfulness

Key finding

Optimize multi-agent LLM workflows with ScoreFlow: continuous, score-aware preference finetuning

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

SymMPO: use symmetric response pairs to reduce multimodal LLM hallucination with a theory-consistent DPO objective

Key finding