Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
If you fine‑tune models with pairwise preference data, standard DPO can unintentionally degrade correct outputs; DPOP is a low‑cost fix that yields more reliable improvements and better leaderboard scores.
Summary TLDR
DPO (Direct Preference Optimisation) can lower the model likelihood of preferred completions when the preferred and dispreferred pair differ by few tokens. The authors prove the token-level failure mode, show it empirically on new paired datasets (MetaMath, ARC, HellaSwag), and propose DPO‑Positive (DPOP): a simple extra penalty that keeps preferred-completion likelihoods high. DPOP prevents the token-level collapse, improves downstream task scores, and was used to build Smaug models (Smaug‑34B, Smaug‑72B). Smaug‑72B hits 80.48% average on the HuggingFace Open LLM Leaderboard and DPOP outperforms DPO on MT‑Bench in controlled comparisons.
Problem Statement
Standard DPO optimises relative preference but can reduce the absolute probability of the preferred completion. This is especially likely when preference pairs differ by only a few tokens, causing later-token probabilities to drop and degrading task accuracy.
Main Contribution
Theoretical proof that DPO can decrease preferred-completion likelihood while improving relative preference.
DPO‑Positive (DPOP): a modified loss that penalises lowering preferred-completion likelihood and fixes the token-level failure.
New paired preference datasets derived from MetaMath, ARC, and HellaSwag and empirical token-level analyses.
Smaug models (7B/34B/72B) fine-tuned with DPOP; Smaug‑72B reaches 80.48% on the HuggingFace Open LLM Leaderboard.
Key Findings
DPO can reduce the model log‑prob of preferred completions on low edit‑distance pairs
DPOP improves across low and high edit‑distance datasets compared to DPO and alternatives
DPOP gives measurable downstream gains on an independent, LLM‑judged benchmark
Smaug‑72B (DPOP fine‑tuned) reached top open‑weight leaderboard numbers
Results
Accuracy
MT-Bench first-turn score (Llama-2-7B finetune)
Token log-prob after edit (preferred completion)
Normalized edit distance
Who Should Care
What To Try In 7 Days
Run a small DPO vs DPOP fine‑tune on an existing paired dataset and compare token log‑probs after edits.
If you use DPO in production tuning, add the DPOP penalty (λ) and test general benchmarks like MT‑Bench.
Convert an important labelled dataset into paired preferences (small edit pairs) and validate DPOP prevents collapse.
Optimization Features
Training Optimization
- adds a loss penalty term to preserve preferred-completion likelihood
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/datasets/Intel/orca_dpo_pairs
- https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned
- MetaMath/ARC/HellaSwag paired datasets (linked in repository)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only six English datasets were evaluated; non‑English behaviour is untested.
- Full ablation at 72B scale was not done due to compute limits; some scale assumptions are extrapolated from smaller models.
- DPOP hyperparameter λ was lightly tuned (default λ=50 used for Smaug) and may need retuning for different bases.
When Not To Use
- You do not need DPOP if you never fine‑tune on paired preference data.
- If your preference pairs are synthetic and guaranteed to be far apart in text and you already validated DPO, gains may be smaller.
Failure Modes
- DPO can decrease the likelihood of the preferred completion while still increasing relative preference, especially when pairs differ by few tokens.
- Token-level 'wrong-way' gradients: tokens after the first differing token can see reduced log‑prob under DPO, breaking autoregressive modeling.
Core Entities
Models
- Smaug-72B
- Smaug-34B
- Smaug-7B
- LoRA
- Bagel-34B-v0.2
- Llama-2-7B-Chat
- Mistral7B
- Yi-34B-200k
- Qwen72B
Metrics
- Accuracy
- MT-Bench first-turn score
- token log-prob change
- normalized edit distance
Datasets
- MetaMath (paired)
- ARC-Challenge (paired)
- HellaSwag (paired)
- ORCA DPO
- Truthy DPO
- UltraFeedback_binarized
- GSM8K
- MT-Bench
- HuggingFace Open LLM Leaderboard
Benchmarks
- HuggingFace Open LLM Leaderboard
- MT-Bench
- MMLU
- GSM8K
- ARC
- HellaSwag
- TruthfulQA
- Winogrande

