Reward models that follow natural-language principles to generalize across preferences

June 4, 20257 min

Overview

Decision SnapshotNeeds Validation

Strongly validated on two benchmarks and ablations; practical but sensitive to principle quality and phrasing.

Citations1

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 75%

Authors

Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A single RM that follows user-written principles lets teams switch evaluation goals quickly (e.g., prioritize accuracy or brevity) without costly relabeling, speeding product iteration and reducing alignment bias.

Who Should Care

Summary TLDR

RewardAnything is a reward model trained to follow explicit natural-language principles (short rules like “accuracy first”) so it can score and rank many candidate responses without retraining. The paper introduces RABENCH, a 1,002-case benchmark to test this ability, shows RewardAnything achieves state-of-the-art on RM-Bench (using principles to reduce bias), generalizes well to unseen principles on RABENCH, and can be used as the sole reward source to align an LLM in practice.

Problem Statement

Current reward models learn a fixed, implicit preference from pairwise labels. They struggle to adapt to new or conflicting user goals (e.g., brevity vs detail) and can inherit biases. Collecting new preferences and retraining is costly. The paper asks: can a single RM follow arbitrary natural-language principles at inference time to adapt to diverse preferences without retraining?

Main Contribution

Define and formalize principle-following reward modeling and curate 200 concrete principles covering content, structure, tone, logic, and style.

Release RABENCH, a benchmark of 1,002 human-verified principle+prompt ranking tasks to measure generalization to novel principles.

Key Findings

RewardAnything achieves state-of-the-art on RM-Bench when given an explicit principle.

Numbers86.4% overall accuracy on RM-Bench (Table 2)

Practical UsePass a clear principle like “accuracy only” to the RM to reduce dataset-induced biases; you can reach SotA RM performance without collecting new preference labels.

Evidence RefTable 2

RewardAnything generalizes to unseen natural-language principles on RABENCH.

Numbers81.9% pairwise ranking accuracy; Kendall's τ = 65.27; NDCG = 97.84 (Table 3)

Practical UseYou can reuse one RM across different evaluation criteria—score/rank new objectives by just swapping the principle, avoiding retraining.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy86.4%best prior generative/discriminative RMs in Table 2≈ +26% vs strong baselinesRM-BenchREWARDANYTHING-8B overall 86.4% (Table 2)Table 2
Accuracy81.9%GPT-4.1 82.5% (Table 3)comparable to top LLM evaluatorsRABENCHREWARDANYTHING overall 81.9; Kendall's τ 65.27; NDCG 97.84 (Table 3)Table 3

What To Try In 7 Days

Run RewardAnything or an instructable RM to score a production prompt set using one clear principle (e.g., 'accuracy first') and compare with your current RM.

Create 10–50 targeted principles (priority + structured rules) and test how outputs rank differently—identify a principle that matches your product goal.

Use RewardAnything as the reward signal for a short GRPO alignment run on a small model with 2k prompts to prototype updated safety or tone behavior.

Agent Features

Tool Use
GRPOsingle-call listwise scoring (Θ(1) LLM calls)
Frameworks
GRPOGRPLPPO-style KL regularization
Architectures
generative RM (listwise)LLM backbone (e.g., Qwen3-8B)

Optimization Features

Token Efficiency
Θ(1) LLM calls and Θ(n) tokens to score n candidates (Table 6)
Infra Optimization
trained on NVIDIA A100 cluster; consumer GPUs viable for inference
Model Optimization
generative listwise scoring model
System Optimization
use vLLM for faster inference
Training Optimization
GRPOlistwise preference learningAccuracy
Inference Optimization
one-shot listwise scoring (Θ(1) calls)inference-time scaling for reasoning

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on principle clarity and explicit priority between goals.

Training used synthetic consensus data; some distributional gaps vs fully human-labeled sets.

When Not To Use

When you cannot craft a clear, prioritized principle for the task.

When extreme low-latency scoring is required and even single-call LLM inference is too slow.

Failure Modes

Mis-specified or ambiguous principles produce unpredictable rankings.

Adversarial or malicious principles can steer models undesirably.

Core Entities

Models

RewardAnything-8BGRPOQwen3-8BGPT-4.1Claude-3.7 SonnetDeepSeek-V3Skywork-Reward-Llama-3.1-8B-v0.2RM-R1-DeepSeek-Distilled-Qwen-32B

Metrics

AccuracyKendall's τNDCGscore variancerefusal rate (safety)

Datasets

RABENCHRM-BenchRewardBenchSkywork-Reward trainsetPKU-SafeRLHFXSTestMT-Bench

Benchmarks

RABENCHRM-BenchRewardBenchXSTestMT-Bench