Learn preferences by contrasting responses across similar prompts, not just identical ones

February 12, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, Mingyuan Zhou

Links

Abstract / PDF

Why It Matters For Business

RPO lets you improve model alignment using both paired and cheaper unpaired preference data, boosting perceived helpfulness in chat and summary tasks while reducing the need for costly paired annotations.

Summary TLDR

RPO (Relative Preference Optimization) extends Direct Preference Optimization by forming a contrast matrix across responses from identical and semantically related prompts. It weights comparisons by prompt similarity (using sentence embeddings) so models can learn from paired and unpaired preference data. On dialogue and summarization tests (LLaMA2, Mistral), RPO improves GPT-4 judged win rates vs DPO, especially when using embedding reweighting and larger batch sizes. Key trade-offs: needs a good embedding model and larger mini-batches (GPU memory).

Problem Statement

Current pairwise preference tuning (e.g., DPO) only compares responses from the same prompt and ignores useful contrasts across related prompts. This limits learning from non-paired preference data and from semantic connections between different prompts.

Main Contribution

RPO method: build a contrast matrix across win/lose responses from identical and semantically related prompts.

Embedding-weighted reweighting: use prompt embeddings to upweight meaningful cross-prompt comparisons and downweight unrelated ones.

Practical evaluation: show RPO (paired and unpaired modes) raises GPT-4 win rates on dialogue, summarization, and AlpacaEval2.0 versus DPO, IPO, KTO and baselines.

Key Findings

RPO (paired) outperforms DPO on dialogue (Mistral-7B).

NumbersRPO-Paired 78.52% vs DPO 72.26% win rate (Anthropic-HH, Mistral-7B).

RPO can use unpaired preference data and still improve performance.

NumbersRPO-Unpaired 75.00% vs DPO 72.26% win rate (Anthropic-HH, Mistral-7B).

Naive weighting hurts performance; semantic weighting helps.

NumbersUniform 68.36% and Diagonal 69.92% vs DPO 72.26% (Mistral-7B).

RPO benefits from larger per-GPU batch sizes.

NumbersWin rate rises from 71.48% (batch 2) to 78.52% (batch 8) on Anthropic-HH (Mistral-7B).

Results

GPT-4 win rate (Anthropic-HH)

Value78.52%

BaselineDPO 72.26%

GPT-4 win rate (Anthropic-HH, unpaired)

Value75.00%

BaselineDPO 72.26%

GPT-4 win rate (AlpacaEval2.0)

Value38.88%

BaselineDPO 30.84%

GPT-4 win rate (Summarization)

Value50.39%

BaselineDPO 48.83%

Who Should Care

What To Try In 7 Days

Run RPO-Unpaired on existing winner/loser logs with all-MiniLM-L6 prompt embeddings and τ≈0.75.

Compare outputs vs your current model using GPT-4 or a small human panel on 200 samples.

If using paired data, try RPO-Paired with τ≈0.5 and increase per-GPU batch size to 4–8 to see gains.

Optimization Features

Infra Optimization

  • Requires larger batch sizes or cross-GPU aggregation for best results

Training Optimization

  • Contrast-matrix training across mini-batch win/lose pairs
  • Embedding-based reweighting of comparisons

Reproducibility

Data Urls

  • Anthropic HH dataset (publicly referenced)
  • OpenAI Summarization dataset (publicly referenced)
  • AlpacaEval2.0 (public benchmark)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on the quality of the sentence embedding model to find meaningful prompt pairs.
  • Contrast matrix size limited by per-GPU mini-batch memory; large batches or cross-GPU aggregation needed.
  • Assumes Z(x) normalization term differences are negligible; this is not fully modeled and could matter for diverse prompts.

When Not To Use

  • When you cannot run larger mini-batches or aggregate across GPUs
  • If you lack a reliable sentence embedding model for your domain
  • For workloads where true pairwise human labels are scarce and embedding similarity is unreliable

Failure Modes

  • Weak embeddings pair unrelated prompts and inject noise, lowering alignment.
  • Too-small batches produce sparse contrast matrices and worse results than DPO.
  • Overfitting to the GPT-4 judge signal if that judge differs from target users.

Core Entities

Models

  • LLaMA2-7B
  • LLaMA2-13B
  • Mistral-7B

Metrics

  • GPT-4 win rate

Datasets

  • Anthropic Helpful and Harmless (HH)
  • OpenAI Summarization (Stiennon et al.)
  • AlpacaEval2.0 (benchmark)

Benchmarks

  • AlpacaEval2.0