Overview
The paper gives both a mathematical derivation and multiple empirical tests showing failure of DPO and improvement from DPOP; results include token‑level diagnostics and independent benchmark gains.
Citations6
Evidence Strength0.80
Confidence0.85
Risk Signals7
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If you fine‑tune models with pairwise preference data, standard DPO can unintentionally degrade correct outputs; DPOP is a low‑cost fix that yields more reliable improvements and better leaderboard scores.
Who Should Care
Summary TLDR
DPO (Direct Preference Optimisation) can lower the model likelihood of preferred completions when the preferred and dispreferred pair differ by few tokens. The authors prove the token-level failure mode, show it empirically on new paired datasets (MetaMath, ARC, HellaSwag), and propose DPO‑Positive (DPOP): a simple extra penalty that keeps preferred-completion likelihoods high. DPOP prevents the token-level collapse, improves downstream task scores, and was used to build Smaug models (Smaug‑34B, Smaug‑72B). Smaug‑72B hits 80.48% average on the HuggingFace Open LLM Leaderboard and DPOP outperforms DPO on MT‑Bench in controlled comparisons.
Problem Statement
Standard DPO optimises relative preference but can reduce the absolute probability of the preferred completion. This is especially likely when preference pairs differ by only a few tokens, causing later-token probabilities to drop and degrading task accuracy.
Main Contribution
Theoretical proof that DPO can decrease preferred-completion likelihood while improving relative preference.
DPO‑Positive (DPOP): a modified loss that penalises lowering preferred-completion likelihood and fixes the token-level failure.
Key Findings
DPO can reduce the model log‑prob of preferred completions on low edit‑distance pairs
DPOP improves across low and high edit‑distance datasets compared to DPO and alternatives
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 80.48% | MoMo-72B-lora-1.8.7-DPO (78.55%) | +1.93% | HuggingFace Open LLM Leaderboard (aggregate of six tasks) | Table 1; Sec.6.2 | Table 1 |
| MT-Bench first-turn score (Llama-2-7B finetune) | DPOP 7.292 ± 0.037 | DPO 7.032 ± 0.043 | +0.260 | MT-Bench (10 trials) | Sec.6.1 MT-Bench | Sec.6.1 |
What To Try In 7 Days
Run a small DPO vs DPOP fine‑tune on an existing paired dataset and compare token log‑probs after edits.
If you use DPO in production tuning, add the DPOP penalty (λ) and test general benchmarks like MT‑Bench.
Convert an important labelled dataset into paired preferences (small edit pairs) and validate DPOP prevents collapse.
Optimization Features
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Only six English datasets were evaluated; non‑English behaviour is untested.
Full ablation at 72B scale was not done due to compute limits; some scale assumptions are extrapolated from smaller models.
When Not To Use
You do not need DPOP if you never fine‑tune on paired preference data.
If your preference pairs are synthetic and guaranteed to be far apart in text and you already validated DPO, gains may be smaller.
Failure Modes
DPO can decrease the likelihood of the preferred completion while still increasing relative preference, especially when pairs differ by few tokens.
Token-level 'wrong-way' gradients: tokens after the first differing token can see reduced log‑prob under DPO, breaking autoregressive modeling.

