Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
SynthesizeMe yields interpretable persona prompts that improve in-context judgment of user preferences without full model finetuning; useful when collecting a few pairwise judgments is feasible but large-scale retraining is not.
Summary TLDR
SynthesizeMe is a three-step method that uses an LLM to (1) generate reasoning about a user's pairwise preferences, (2) synthesize a short natural-language persona from that reasoning, and (3) pick informative prior examples to include as demonstrations. The resulting persona-based prompt improves in-context LLM-as-a-judge accuracy (up to +4.4% on Chatbot Arena; +3.41% on PRISM) and helps produce state-of-the-art personalized reward-model performance on a new benchmark (PersonalRewardBench) built from filtered Chatbot Arena and PRISM users. The method is interpretable, transfers across models, and benefits from prompt optimization and distillation from larger LLMs.
Problem Statement
Modern reward models train on pooled human preferences, but real users have diverse, individual tastes. Learning a personalized reward model from few pairwise comparisons per user faces two problems: (1) data scarcity (typically 5–15 pairs) and (2) preference attribution (why did the user choose one response?). We need a way to bootstrap interpretable personal signals from limited pairwise feedback and use them to personalize reward judgments without heavy finetuning.
Main Contribution
Define personalized reward modeling from user-level pairwise preferences and formalize evaluation on per-user target sets.
Propose SynthesizeMe: bootstrap chain-of-thought reasoning over user pairs, synthesize a natural-language persona, and select informative demonstrations to build a personalized prompt.
Introduce PersonalRewardBench: user-stratified, filtered splits from Chatbot Arena and PRISM for evaluating personalized reward models and release optimized persona prompts.
Key Findings
SynthesizeMe boosts LLM-as-a-judge accuracy on Chatbot Arena.
SynthesizeMe also improves performance on PRISM.
Finetuned in-distribution reward models remain strongest when large data exists.
Persona fidelity grows with model size.
More context preferences steadily improve accuracy.
Results
Accuracy
Accuracy
Accuracy
Persona fidelity (match rate to stated preferences)
Accuracy
Who Should Care
What To Try In 7 Days
Collect 5–15 pairwise preference examples per active user and store them.
Run SynthesizeMe persona synthesis (bootstrap reasoning → persona → demos) with a mid-size LLM and test its LLM-as-a-judge accuracy on held-out pairs.
If you have a large model, distill the persona-generation prompt and deploy the distilled prompt with smaller models for cost savings.
Agent Features
Memory
- uses prior user interactions as context (short-term preference history)
Frameworks
- DSPy-based prompting pipeline
Optimization Features
Token Efficiency
- limit bootstrapped reasoning + demos to budgeted trials (n=10, m=10)
Model Optimization
- distill persona prompt Θ from larger LLM to smaller LLMs
System Optimization
- optimize persona generation prompt with MIPROv2
Training Optimization
- LoRA
Inference Optimization
- use distilled persona prompts for cheaper inference
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires pairwise preference data; cannot run without user judgments.
- Evaluated primarily in low-data personal settings (typically <25 pairs per user).
- Improvements for fine-tuned reward models are small and sometimes within confidence intervals.
- Potential ethical harms (amplification, sycophancy, anthropomorphism) when tailoring outputs to individuals.
- Unclear if PersonalRewardBench and prompt artifacts are publicly released for reproduction.
When Not To Use
- You have abundant in-distribution preference logs and can train a fine-tuned reward model.
- You cannot collect any pairwise preference labels from users.
- Personalization poses unacceptable privacy or safety risks without strong oversight.
Failure Modes
- Synthesized persona does not reflect true user preferences (noisy or adversarial labels).
- Overfitting to a few examples yields degraded generalization to new queries.
- Distilled persona prompts from small teachers fail to transfer to larger student models.
- Persona-driven prompts amplify user biases or reinforce extreme preferences.
Core Entities
Models
- Llama-3.2-3B
- Llama-3.1-8B
- Llama-3.3-70B
- GPT4o-mini
- Gemini-2.5
- Qwen3-8B
Metrics
- Accuracy
- persona match rate
- median preference pairs
Datasets
- Chatbot Arena
- PRISM
- PersonalRewardBench
Benchmarks
- PersonalRewardBench
Context Entities
Models
- Gemini-2.0-Flash
- Qwen3-30B
- Qwen3-32B

