RLHF Papers — Parsed & Scored for Practitioners

Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

0.70

0.30

0.60

2,595

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Key finding

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers: 2.0T tokens; sizes 7B,13B,34B,70B

Practical review of data, training, and evaluation methods to align LLMs with human preferences

0.60

0.40

0.70

54

Aligning LLMs reduces risky outputs and increases usefulness; using parameter-efficient tuning cuts compute costs and enables faster iteration.

Key finding

Small sets of high-quality instructions can suffice to produce alignment effects.

Numbers: LLaMA needs ~8K instructions (IFS); other work reports ~6K high-quality instructions

Practical, end-to-end guide to fine-tuning LLMs: pipelines, PEFT, RAG, alignment and deployment

0.70

0.35

0.70

39

Fine-tuning and RAG let you customise LLM behavior and accuracy while controlling cost; PEFT and quantisation let you ship tailored models without enterprise-scale GPU fleets.

Key finding

QLoRA compresses model parameters and enables 4-bit fine-tuning while retaining near-16-bit performance.

Numbers: Reduces to ~5.2 bits/parameter (from 96 bits); ~18x memory reduction

Practical survey of methods, attacks, and evaluations for aligning large language models

0.45

0.40

0.50

34

Misaligned LLMs can produce legal, reputational, and safety failures. Alignment methods reduce harmful outputs but need governance, red-teaming, and evaluation to manage adversarial and privacy risks.

Key finding

Reinforcement Learning from Human Feedback (RLHF) is the most common non-recursive oversight method for aligning LLMs.

Decouple helpfulness and harmlessness, then use a Lagrangian Safe-RL step to trade off both during RLHF

0.50

0.60

0.40

20

Safe RLHF lets you improve usefulness without sacrificing safety by separating labels and using a dynamic constraint; this reduces harmful outputs strongly while preserving or increasing helpfulness, lowering moderation load and risk.

Key finding

Iterative Safe RLHF reduced measured harmful responses from Alpaca-7B's 53.08% to 2.45% on the paper's evaluation set.

Numbers: Harmful probability 53.08% → 2.45%

ChiMed‑GPT: a 13B Chinese medical LLM trained with pretraining, SFT and RLHF for safer, better medical answers

0.60

0.45

0.50

10

ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.

Key finding

Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.

Numbers: BLEU-1 33.14 (ChiMed‑GPT) vs 24.29 (GPT-4)

Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

0.60

0.80

9

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Key finding

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbers: harmful rate 47% -> 1% (OPT-1.3B, Table 3)

Train LLMs on private data with federated learning; OpenFedLLM shows FL beats local training and can beat GPT‑4 in finance

0.60

0.55

0.65

6

Companies with private domain data can jointly fine-tune LLMs privately and get measurable gains over solo training; finance firms, hospitals, and firms with sensitive data can gain domain-leading models without sharing raw data.

Key finding

Federated learning consistently improves over single-client local fine-tuning across tasks.

Numbers: multiple tables: e.g., Table 4 MT-Avg FedAvg 3.346 vs Local 2.844 (open-ended)

Reduce multimodal model hallucinations by learning from segment-level human corrections

0.60

0.70

5

RLHF-V makes multimodal models more trustworthy with far less labeled data and short retrain time, lowering risk when deploying vision-language assistants in customer-facing or safety-critical products.

Key finding

Fine-grained corrections cut hallucinations on a human-eval benchmark

Numbers: 34.8% reduction on MHumanEval (object hallucination, 1.4k prefs)

Make open-source multimodal models far more truthful using AI feedback and self-reward at inference

0.70

0.60

3

RLAIF-V lets teams reduce multimodal hallucination without expensive human labeling or proprietary APIs, lowering alignment costs and improving product trust where visual accuracy matters.

Key finding

RLAIF-V 7B cuts object hallucination on Object HalBench by a large relative amount

Numbers: object hallucination reduced by 80.7% (Rsp. rate 54.5→10.5)

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

0.60

0.30

3

Alignment choices change who a model helps: biased SFT/PT can reduce utility for non‑US dialects, misrepresent global opinions, and harm product adoption in key markets.

Key finding

Alignment raises English dialect performance unevenly, favoring US English.

Numbers: Dialect disparity grew from ~1% before alignment to up to 17.1% after alignment

Hybrid RLAIF (HRLAIF): use task-aware AI labeling + AI red teaming to keep helpfulness while improving harmlessness

0.60

0.50

0.80

3

AI-labeling massively cuts annotation cost and speeds model iteration, but naive use can teach models to prioritize style over correctness; adding lightweight, category-specific verification and AI red-teaming preserves helpfulness and lowers toxicity with little extra cost.

Key finding

Hybrid AI labeling raises AI-vs-human label agreement on multiple-choice and math.

Numbers: Multiple-choice: +34.08pp (48.13%→82.21%); Math: +24.45pp (55.55%→80.00%)

Okapi: first open-source RLHF instruction-tuned LLMs across 26 languages

0.50

0.60

3

If you need multilingual chat or QA features, investing in translated instructions and RLHF can yield measurable accuracy gains and broader language coverage while keeping models open-source.

Key finding

RLHF improves multilingual instruction-following over SFT on average.

Numbers: BLOOM average accuracy: SFT 28.4 -> RLHF 30.0 (+1.6)

How alignment choices change LLMs' ability to prod groups to think slowly and reach correct shared conclusions

0.40

0.60

0.50

2

If you deploy LLMs as in-team helpers or moderators, align them to account for how people or other agents reinterpret suggestions; friction-aware alignment yields more accurate shared decisions than methods that only optimize immediate preference labels.

Key finding

FAAF achieves the highest task accuracy on the Wason/DeliData task under collaborator-modification.

Numbers: Coarse accuracy FAAF 52.6% vs DPO 42.8% (MAMDP, Table 1).

Survey: aligning diffusion models to human preferences — methods, benchmarks, and open problems

0.60

0.40

0.70

2

Aligning diffusion models cuts customer friction and reduces safety risks; aligned models produce outputs that match user intent and lower moderation costs.

Key finding

Alignment research is heavily concentrated on language models; diffusion model alignment is a small fraction.

Numbers: LLMs: 89.4% of studies; diffusion models: 10.6% (Google Scholar, Jan 15, 2026)

A small plug-and-play model learns to 'correct' LLM outputs, improving helpfulness and safety without retraining big models

0.60

0.75

2

Train one small Aligner once to improve safety and usefulness of many deployed models (including API models) while avoiding heavy RLHF pipelines, cutting alignment cost and speeding iteration cycles.

Key finding

Aligner-7B improves average helpfulness and harmlessness across evaluated upstream models.

Numbers: helpfulness +21.9%, harmlessness +23.8% (across tested models)

Reduce VLLM hallucinations by fine-tuning with AI-generated 'wrong' answers

0.60

0.70

2

POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.

Key finding

POVID substantially reduces object-hallucination on captioning benchmarks.

Numbers: CHAIR S: 66.8 → 31.8 (absolute -35.0)

Reward models that follow natural-language principles to generalize across preferences

0.70

0.75

0.60

1

A single RM that follows user-written principles lets teams switch evaluation goals quickly (e.g., prioritize accuracy or brevity) without costly relabeling, speeding product iteration and reducing alignment bias.

Key finding

RewardAnything achieves state-of-the-art on RM-Bench when given an explicit principle.

Numbers: 86.4% overall accuracy on RM-Bench (Table 2)

MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

0.60

0.70

0.60

1

MM-RLHF provides large, human-quality preference data and practical training recipes that reduce unsafe outputs and boost conversation quality, so teams can make multimodal products more reliable without depending only on massive closed-source reward models.

Key finding

Dataset scale and construction

Numbers: 120k ranked pairs; sampled from 10M raw instances and ~30k queries

Argues for hybrid moral alignment: combine explicit moral principles with learning to get safer, adaptable agents

0.35

0.60

0.45

1

Hybrid moral alignment helps build AI that is both controllable (auditable rules) and adaptable (learned behavior), reducing legal, reputational and safety risks in agentic products.

Key finding

Most existing approaches lie at two extremes: fully top-down rules or fully bottom-up learned preferences.

CARDS: segment-level rejection sampling cuts decoding-time alignment cost by ~70%

0.70

0.60

0.80

1

CARDS reduces runtime and total forward calls ~3x while improving judged helpfulness and safety, making decoding-time alignment far more practical for production without model fine-tuning.

Key finding

CARDS cuts decoding inference time by about 70% compared to common baselines on evaluated setups.

Numbers: llama-7b BoN 234.7min → CARDS 75.8min (Table 1)

Survey: how uncertainty moved from a passive confidence score to an active control signal in LLM systems

0.60

0.70

0

Turning uncertainty into an active control signal can make LLMs safer and more efficient in production: fewer costly tool calls, targeted extra computation only when needed, and more robust policy learning that resists reward hacking.

Key finding

Uncertainty is already being used as an active control signal in three main areas: advanced reasoning, autonomous agents, and RL/reward modeling.

Multi-agent LLM pipeline that auto-generates themes from clinical transcripts and optionally adapts with RLHF

0.40

0.60

0

Auto-TA can turn large sets of interview transcripts into actionable themes quickly. That lets health services, product teams, and research groups scale qualitative analysis without hiring proportional human coders. However, you must validate output quality and watch for domain drift.

Key finding

Assigning domain identities to coder agents substantially improved credibility scores.

Numbers: Credibility baseline 82.13 → Cardiac Surgeon 98.41 (+16.28)

CANOE: use synthetic short QA + rule-based RL to cut hallucinations and improve long-form faithfulness

0.60

0.50

0

CANOE improves context-grounded answers without human labels, lowering hallucination risk for production assistants and RAG systems while keeping costs down by tuning smaller open models.

Key finding

CANOE raised average EM/Acc across 11 faithfulness tasks for LLaMA-3-Instruct-8B by +22.6 percentage points.

Numbers: Avg EM +22.6% (LLaMA-3-8B), Table 1