Create short, human-readable persona prompts from a few user preference pairs to improve personalized reward judgments

June 5, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, Diyi Yang

Links

Abstract / PDF

Why It Matters For Business

SynthesizeMe yields interpretable persona prompts that improve in-context judgment of user preferences without full model finetuning; useful when collecting a few pairwise judgments is feasible but large-scale retraining is not.

Summary TLDR

SynthesizeMe is a three-step method that uses an LLM to (1) generate reasoning about a user's pairwise preferences, (2) synthesize a short natural-language persona from that reasoning, and (3) pick informative prior examples to include as demonstrations. The resulting persona-based prompt improves in-context LLM-as-a-judge accuracy (up to +4.4% on Chatbot Arena; +3.41% on PRISM) and helps produce state-of-the-art personalized reward-model performance on a new benchmark (PersonalRewardBench) built from filtered Chatbot Arena and PRISM users. The method is interpretable, transfers across models, and benefits from prompt optimization and distillation from larger LLMs.

Problem Statement

Modern reward models train on pooled human preferences, but real users have diverse, individual tastes. Learning a personalized reward model from few pairwise comparisons per user faces two problems: (1) data scarcity (typically 5–15 pairs) and (2) preference attribution (why did the user choose one response?). We need a way to bootstrap interpretable personal signals from limited pairwise feedback and use them to personalize reward judgments without heavy finetuning.

Main Contribution

Define personalized reward modeling from user-level pairwise preferences and formalize evaluation on per-user target sets.

Propose SynthesizeMe: bootstrap chain-of-thought reasoning over user pairs, synthesize a natural-language persona, and select informative demonstrations to build a personalized prompt.

Introduce PersonalRewardBench: user-stratified, filtered splits from Chatbot Arena and PRISM for evaluating personalized reward models and release optimized persona prompts.

Key Findings

SynthesizeMe boosts LLM-as-a-judge accuracy on Chatbot Arena.

Numbersup to +4.4% absolute accuracy (Chatbot Arena)

SynthesizeMe also improves performance on PRISM.

Numbers+3.41% absolute accuracy (PRISM)

Finetuned in-distribution reward models remain strongest when large data exists.

NumbersBradley–Terry reward models ≈ 61–72% accuracy vs. LLM-as-judge ≈ 52–62% (varies by model)

Persona fidelity grows with model size.

Numberspersona 'true match' rate rises 26.5%→50.2% (3B→8B) and 50.2%→56.1% (8B→70B)

More context preferences steadily improve accuracy.

Numbers≈ +0.8% absolute accuracy per extra context preference on Chatbot Arena

Results

Accuracy

Valueup to +4.4% absolute

BaselineDefault LLM-as-a-judge prompts

Accuracy

Value+3.41% absolute

BaselineDefault LLM-as-a-judge prompts

Accuracy

Value≈61–72% accuracy (varies by model size and dataset)

BaselineLLM-as-a-judge methods (~52–62%)

Persona fidelity (match rate to stated preferences)

Value26.5% → 50.2% → 56.1% as model goes 3B→8B→70B

Baselinerandom pairing

Accuracy

Value≈ +0.8% absolute per extra preference

Baselineusers with fewer context preferences

Who Should Care

What To Try In 7 Days

Collect 5–15 pairwise preference examples per active user and store them.

Run SynthesizeMe persona synthesis (bootstrap reasoning → persona → demos) with a mid-size LLM and test its LLM-as-a-judge accuracy on held-out pairs.

If you have a large model, distill the persona-generation prompt and deploy the distilled prompt with smaller models for cost savings.

Agent Features

Memory

  • uses prior user interactions as context (short-term preference history)

Frameworks

  • DSPy-based prompting pipeline

Optimization Features

Token Efficiency

  • limit bootstrapped reasoning + demos to budgeted trials (n=10, m=10)

Model Optimization

  • distill persona prompt Θ from larger LLM to smaller LLMs

System Optimization

  • optimize persona generation prompt with MIPROv2

Training Optimization

  • LoRA

Inference Optimization

  • use distilled persona prompts for cheaper inference

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires pairwise preference data; cannot run without user judgments.
  • Evaluated primarily in low-data personal settings (typically <25 pairs per user).
  • Improvements for fine-tuned reward models are small and sometimes within confidence intervals.
  • Potential ethical harms (amplification, sycophancy, anthropomorphism) when tailoring outputs to individuals.
  • Unclear if PersonalRewardBench and prompt artifacts are publicly released for reproduction.

When Not To Use

  • You have abundant in-distribution preference logs and can train a fine-tuned reward model.
  • You cannot collect any pairwise preference labels from users.
  • Personalization poses unacceptable privacy or safety risks without strong oversight.

Failure Modes

  • Synthesized persona does not reflect true user preferences (noisy or adversarial labels).
  • Overfitting to a few examples yields degraded generalization to new queries.
  • Distilled persona prompts from small teachers fail to transfer to larger student models.
  • Persona-driven prompts amplify user biases or reinforce extreme preferences.

Core Entities

Models

  • Llama-3.2-3B
  • Llama-3.1-8B
  • Llama-3.3-70B
  • GPT4o-mini
  • Gemini-2.5
  • Qwen3-8B

Metrics

  • Accuracy
  • persona match rate
  • median preference pairs

Datasets

  • Chatbot Arena
  • PRISM
  • PersonalRewardBench

Benchmarks

  • PersonalRewardBench

Context Entities

Models

  • Gemini-2.0-Flash
  • Qwen3-30B
  • Qwen3-32B