Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
RPO lets you improve model alignment using both paired and cheaper unpaired preference data, boosting perceived helpfulness in chat and summary tasks while reducing the need for costly paired annotations.
Summary TLDR
RPO (Relative Preference Optimization) extends Direct Preference Optimization by forming a contrast matrix across responses from identical and semantically related prompts. It weights comparisons by prompt similarity (using sentence embeddings) so models can learn from paired and unpaired preference data. On dialogue and summarization tests (LLaMA2, Mistral), RPO improves GPT-4 judged win rates vs DPO, especially when using embedding reweighting and larger batch sizes. Key trade-offs: needs a good embedding model and larger mini-batches (GPU memory).
Problem Statement
Current pairwise preference tuning (e.g., DPO) only compares responses from the same prompt and ignores useful contrasts across related prompts. This limits learning from non-paired preference data and from semantic connections between different prompts.
Main Contribution
RPO method: build a contrast matrix across win/lose responses from identical and semantically related prompts.
Embedding-weighted reweighting: use prompt embeddings to upweight meaningful cross-prompt comparisons and downweight unrelated ones.
Practical evaluation: show RPO (paired and unpaired modes) raises GPT-4 win rates on dialogue, summarization, and AlpacaEval2.0 versus DPO, IPO, KTO and baselines.
Key Findings
RPO (paired) outperforms DPO on dialogue (Mistral-7B).
RPO can use unpaired preference data and still improve performance.
Naive weighting hurts performance; semantic weighting helps.
RPO benefits from larger per-GPU batch sizes.
Results
GPT-4 win rate (Anthropic-HH)
GPT-4 win rate (Anthropic-HH, unpaired)
GPT-4 win rate (AlpacaEval2.0)
GPT-4 win rate (Summarization)
Who Should Care
What To Try In 7 Days
Run RPO-Unpaired on existing winner/loser logs with all-MiniLM-L6 prompt embeddings and τ≈0.75.
Compare outputs vs your current model using GPT-4 or a small human panel on 200 samples.
If using paired data, try RPO-Paired with τ≈0.5 and increase per-GPU batch size to 4–8 to see gains.
Optimization Features
Infra Optimization
- Requires larger batch sizes or cross-GPU aggregation for best results
Training Optimization
- Contrast-matrix training across mini-batch win/lose pairs
- Embedding-based reweighting of comparisons
Reproducibility
Data Urls
- Anthropic HH dataset (publicly referenced)
- OpenAI Summarization dataset (publicly referenced)
- AlpacaEval2.0 (public benchmark)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on the quality of the sentence embedding model to find meaningful prompt pairs.
- Contrast matrix size limited by per-GPU mini-batch memory; large batches or cross-GPU aggregation needed.
- Assumes Z(x) normalization term differences are negligible; this is not fully modeled and could matter for diverse prompts.
When Not To Use
- When you cannot run larger mini-batches or aggregate across GPUs
- If you lack a reliable sentence embedding model for your domain
- For workloads where true pairwise human labels are scarce and embedding similarity is unreliable
Failure Modes
- Weak embeddings pair unrelated prompts and inject noise, lowering alignment.
- Too-small batches produce sparse contrast matrices and worse results than DPO.
- Overfitting to the GPT-4 judge signal if that judge differs from target users.
Core Entities
Models
- LLaMA2-7B
- LLaMA2-13B
- Mistral-7B
Metrics
- GPT-4 win rate
Datasets
- Anthropic Helpful and Harmless (HH)
- OpenAI Summarization (Stiennon et al.)
- AlpacaEval2.0 (benchmark)
Benchmarks
- AlpacaEval2.0

