Reward models that follow natural-language principles to generalize across preferences

June 4, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.75

Cost Impact Score

0.6

Citation Count

1

Authors

Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye

Links

Abstract / PDF

Why It Matters For Business

A single RM that follows user-written principles lets teams switch evaluation goals quickly (e.g., prioritize accuracy or brevity) without costly relabeling, speeding product iteration and reducing alignment bias.

Summary TLDR

RewardAnything is a reward model trained to follow explicit natural-language principles (short rules like “accuracy first”) so it can score and rank many candidate responses without retraining. The paper introduces RABENCH, a 1,002-case benchmark to test this ability, shows RewardAnything achieves state-of-the-art on RM-Bench (using principles to reduce bias), generalizes well to unseen principles on RABENCH, and can be used as the sole reward source to align an LLM in practice.

Problem Statement

Current reward models learn a fixed, implicit preference from pairwise labels. They struggle to adapt to new or conflicting user goals (e.g., brevity vs detail) and can inherit biases. Collecting new preferences and retraining is costly. The paper asks: can a single RM follow arbitrary natural-language principles at inference time to adapt to diverse preferences without retraining?

Main Contribution

Define and formalize principle-following reward modeling and curate 200 concrete principles covering content, structure, tone, logic, and style.

Release RABENCH, a benchmark of 1,002 human-verified principle+prompt ranking tasks to measure generalization to novel principles.

Build REWARDANYTHING: a generative, listwise reward model trained with GRPO and custom listwise rewards that scores and ranks many candidates in one call and can steer RLHF without RM retraining.

Key Findings

RewardAnything achieves state-of-the-art on RM-Bench when given an explicit principle.

Numbers86.4% overall accuracy on RM-Bench (Table 2)

RewardAnything generalizes to unseen natural-language principles on RABENCH.

Numbers81.9% pairwise ranking accuracy; Kendall's τ = 65.27; NDCG = 97.84 (Table 3)

Explicit principles and listwise GRPO training materially improve performance; removing them hurts results.

NumbersOverall drops to ~67.4 without principles; pairwise/listwise conversion drops to ~73.2 (Table 4)

A practical alignment case used RewardAnything as the sole reward source to align an LLM for nuanced safety.

Results

Accuracy

Value86.4%

Baselinebest prior generative/discriminative RMs in Table 2

Accuracy

Value81.9%

BaselineGPT-4.1 82.5% (Table 3)

Ablation: remove principles

Value≈ 67.4% overall

BaselineREWARDANYTHING 81.9%

Training regime: listwise → pairwise

Value≈ 73.2% overall

BaselineREWARDANYTHING 81.9%

Who Should Care

What To Try In 7 Days

Run RewardAnything or an instructable RM to score a production prompt set using one clear principle (e.g., 'accuracy first') and compare with your current RM.

Create 10–50 targeted principles (priority + structured rules) and test how outputs rank differently—identify a principle that matches your product goal.

Use RewardAnything as the reward signal for a short GRPO alignment run on a small model with 2k prompts to prototype updated safety or tone behavior.

Agent Features

Tool Use

  • GRPO
  • single-call listwise scoring (Θ(1) LLM calls)

Frameworks

  • GRPO
  • GRPL
  • PPO-style KL regularization

Architectures

  • generative RM (listwise)
  • LLM backbone (e.g., Qwen3-8B)

Optimization Features

Token Efficiency

  • Θ(1) LLM calls and Θ(n) tokens to score n candidates (Table 6)

Infra Optimization

  • trained on NVIDIA A100 cluster; consumer GPUs viable for inference

Model Optimization

  • generative listwise scoring model

System Optimization

  • use vLLM for faster inference

Training Optimization

  • GRPO
  • listwise preference learning
  • Accuracy

Inference Optimization

  • one-shot listwise scoring (Θ(1) calls)
  • inference-time scaling for reasoning

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Performance depends on principle clarity and explicit priority between goals.
  • Training used synthetic consensus data; some distributional gaps vs fully human-labeled sets.
  • Principles themselves can be adversarially manipulated or mis-specified.
  • Human verification for RABENCH was limited; Cohen's κ = 0.57 indicates moderate agreement.

When Not To Use

  • When you cannot craft a clear, prioritized principle for the task.
  • When extreme low-latency scoring is required and even single-call LLM inference is too slow.
  • When you need fully human-collected preference signals for legal/regulatory audits.

Failure Modes

  • Mis-specified or ambiguous principles produce unpredictable rankings.
  • Adversarial or malicious principles can steer models undesirably.
  • Reward hacking if downstream policy optimizes around loopholes in the principle.
  • Reasoning-stage hallucinations can distort scores if inference reasoning is mistaken.

Core Entities

Models

  • RewardAnything-8B
  • GRPO
  • Qwen3-8B
  • GPT-4.1
  • Claude-3.7 Sonnet
  • DeepSeek-V3
  • Skywork-Reward-Llama-3.1-8B-v0.2
  • RM-R1-DeepSeek-Distilled-Qwen-32B

Metrics

  • Accuracy
  • Kendall's τ
  • NDCG
  • score variance
  • refusal rate (safety)

Datasets

  • RABENCH
  • RM-Bench
  • RewardBench
  • Skywork-Reward trainset
  • PKU-SafeRLHF
  • XSTest
  • MT-Bench

Benchmarks

  • RABENCH
  • RM-Bench
  • RewardBench
  • XSTest
  • MT-Bench