Decouple helpfulness and harmlessness, then use a Lagrangian Safe-RL step to trade off both during RLHF

October 19, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

20

Authors

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang

Links

Abstract / PDF

Why It Matters For Business

Safe RLHF lets you improve usefulness without sacrificing safety by separating labels and using a dynamic constraint; this reduces harmful outputs strongly while preserving or increasing helpfulness, lowering moderation load and risk.

Summary TLDR

Safe RLHF separates human labels for helpfulness and harmlessness, trains two preference models (reward and cost), and uses a constrained RL objective solved by a Lagrangian update to balance reward vs. safety. Applied to Alpaca-7B over three iterative fine-tuning rounds with red-teaming and human labels, the method reduced harmful outputs sharply (53.08% → 2.45% on the paper's eval set) while improving helpfulness by large Elo margins versus the SFT baseline. Code and datasets are released. The method is single-turn, requires human labeling and red-teaming, and incurs non-trivial cost.

Problem Statement

Standard RLHF mixes helpfulness and safety into one signal, which confuses annotators and makes training fragile when objectives conflict. The paper asks: can we decouple helpfulness and harmlessness in annotation and solve fine-tuning as a constrained RL problem so models become both more helpful and safer?

Main Contribution

A Safe RLHF pipeline that decouples helpfulness and harmlessness at annotation, trains separate reward and cost preference models, and optimizes the policy under a cost constraint using a Lagrangian update.

A cost (harmlessness) preference model that both ranks responses and predicts binary safety labels; this lets the RL step adjust safety pressure dynamically.

Iterative, red-team informed training and a released dataset+code demonstrating large reductions in harmful outputs and improvements in Elo scores on a custom safety-focused evaluation.

Key Findings

Iterative Safe RLHF reduced measured harmful responses from Alpaca-7B's 53.08% to 2.45% on the paper's evaluation set.

NumbersHarmful probability 53.08% → 2.45%

Beaver-v3 improved helpfulness Elo by large margins over Alpaca-7B: GPT-4 +244.91 and human +363.86.

NumbersHelpfulness Elo +244.91 (GPT-4), +363.86 (human)

Reward and cost preference models achieve reasonable ranking and classification accuracy (reward ranking ≈77–78%; cost safety classification up to 95.6% in v1).

NumbersReward rank ~78%; Cost class 95.62% (v1), ~85% (unified)

Decoupled (two-dim) annotation raised inter-rater agreement versus single-dimension preference labeling: Helpfulness 69.00% vs 61.65%.

NumbersInter-rater: helpfulness 69.00% vs 61.65%

Static reward shaping with fixed cost weight underperforms Safe RLHF across tested weights: extremes or moderate fixed weights still trade off one objective for the other.

NumbersRS weights ν in {0.01..100} gave worse trade-offs than Safe RLHF

Results

Probability of harmful responses (human labels)

Value53.08% → 2.45%

BaselineAlpaca-7B 53.08%

Helpfulness Elo (GPT-4 evaluator)

ValueAlpaca-7B → Beaver-v3: +244.91

BaselineAlpaca-7B

Harmlessness Elo (GPT-4 evaluator)

ValueAlpaca-7B → Beaver-v3: +268.31

BaselineAlpaca-7B

Accuracy

Value≈77–78%

Baselineper-round RMs trained on respective preference data

Accuracy

Value95.62% (Beaver-v1), 84–86% (others/unified)

Baselineper-round CMs

Who Should Care

What To Try In 7 Days

Collect a small set of prompts and label helpfulness and harmlessness separately for a sample of model outputs.

Train quick reward and cost preference models on that sample to verify signal quality (check ranking & safety accuracy).

Run a short constrained PPO run with a Lagrange multiplier on a small model or subset to see safety vs. utility trade-offs.

Optimization Features

Training Optimization

  • PPO fine-tuning with Lagrangian multiplier updates
  • SFT
  • Alternating min-max updates for θ and λ

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-turn setting only; multi-turn conversations not addressed.
  • Relies on costly human labeling and red-teaming across iterations.
  • Pretraining data (original LLaMA corpus) unavailable to authors; results depend on Alpaca/LLaMA-1 base.
  • Public red-team data poses reuse/misuse risks that the authors acknowledge.

When Not To Use

  • If you cannot afford iterative human labeling and red-team effort.
  • For multi-turn dialogue without adapting the method first.
  • If you need zero-shot safety guarantees outside the covered safety categories.

Failure Modes

  • Partial harmfulness: refusals that still leak harmful content in the explanation.
  • Role-play or instruction-following prompts can still elicit unsafe outputs.
  • Overfitting to the red-team prompts and missing unseen attack types.
  • Misuse of released red-team data to fine-tune harmful models.

Core Entities

Models

  • Alpaca-7B
  • LLaMA-7B
  • Beaver-v1
  • Beaver-v2
  • Beaver-v3
  • Reward Model (RM)
  • Cost Model (CM)

Metrics

  • Elo score
  • Accuracy
  • Probability of harmful response

Datasets

  • SFT
  • Safe RLHF preference datasets (D_R helpfulness, D_C harmlessness)
  • Released red-team prompts (paper dataset)

Benchmarks

  • Custom safety evaluation prompts (14 harm categories + red-team prompts)