KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

Overview

Decision SnapshotNeeds Validation

KLQ is theoretically clearer and matches PPO on small-scale tests; evidence is promising but limited by noisy, single-run style experiments and modest compute scale.

Citations0

Evidence Strength0.50

Confidence0.72

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 70%

Authors

Jason R Brown, Lennie Wells, Edward James Young, Sergio Bacallado

Links

Abstract / PDF / Data

Why It Matters For Business

KLQ offers a theoretically grounded alternative to PPO that matches compute and reward performance and can produce outputs preferred by LLM judges, potentially improving product output quality without added infrastructure cost.

Who Should Care

ML Engineer Product Manager CTO Founder Data Scientist

Summary TLDR

The authors introduce KLQ, an on-policy token-level Q-learning method for RLHF that uses a KL-regularised action-value decomposition and λ-returns. KLQ has a clearer theoretical motivation and is shown analytically to be equivalent to a modified PPO. Empirically on TL;DR summarisation and Anthropic-HH single-turn dialogue, KLQ matches PPO on reward metrics and per-update compute, and yields higher win-rates in LLM-as-a-judge comparisons, though experiments are small-scale and noisy.

Problem Statement

PPO is the de-facto algorithm for LM-RLHF but handles the KL constraint heuristically. We need a token-level action-value method that (1) fits the KL-regularised LM setting, (2) lets you initialise from SFT policies, and (3) offers clearer theory and comparable or better empirical outcomes.

Main Contribution

KLQ: a token-level, KL-regularised Q-learning algorithm for LM-RLHF that trains an action-value function parametrised by the policy and a value head.

A λ-return based value estimator adapted to action-value learning to propagate sparse final reward signals across tokens.

Key Findings

KLQ matches PPO on the LM-RLHF reward objective and uses about the same compute per run.

Numbers75,000 episodes per run; training ~5 hours on 4 A100 GPUs; near-identical final validation rewards (figures shown).

Practical UseIf you currently use PPO for RLHF, you can switch to KLQ without extra compute and expect similar reward optimization out of the box.

Evidence RefSections 5.1, E.1; Table 3; Figure 1

KLQ's final policies are consistently preferred by an LLM judge compared to PPO across tested KL penalties.

NumbersEvaluations used 32 validation prompts and both orderings (64 comparisons) per KL coefficient; KLQ had higher win-rate '

Practical UseUse KLQ if human-like judgments (via LLM-as-a-judge) matter; it may yield higher-quality completions while matching numeric reward.

Evidence RefSection 5.3; Figure 3; Appendix F

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Final RLHF reward (validation)	KLQ ≈ PPO (similar final validation reward curves)	PPO	no significant gap	TL;DR, Anthropic-HH (validation)	Figure 1; Figure 4; Section 5.1	Sections 5.1, E.1; Figures 1 and 4
Compute per run (wall-clock)	≈5 hours on 4 A100 GPUs	PPO	negligible difference (per Table 3)	TL;DR and HH	Table 3; Section 5	Appendix E.1 Table 3

What To Try In 7 Days

Run a small-scale KLQ finetune from your SFT model on a representative dataset to compare reward and judged quality.

Use the same hyperparameters and compute budget as your PPO runs to isolate algorithmic differences.

Run an LLM-as-a-judge comparison on ~30 prompts to get a quick signal of qualitative preference.

Optimization Features

Token Efficiency

token-level (per-token) reward attribution via KL regularisation

Training Optimization

λ-returns for action-values (propagates sparse final rewards)action-value decomposition linking Q to policy and value headon-policy minibatch gradient updates (multiple epochs per batch)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

TL;DR datasetAnthropic-HH datasetHuggingFace model hub (used SFT and reward models)

Risks & Boundaries

Limitations

Experiments are small-scale: 75k episodes per run and ~5 hours on 4 A100 GPUs; few repeats.

Used off-the-shelf SFT and reward models from HuggingFace—reward model quality and biases not controlled.

When Not To Use

If you need thoroughly validated large-scale RLHF results; KLQ currently has only preliminary-scale evidence.

If you cannot add a token-level value head or prefer completion-level grouped-rollout methods without value heads.

Failure Modes

Training noise can dominate small differences between KLQ and PPO, leading to ambiguous outcomes.

Poor reward models or SFT policies may bias both methods and hide algorithmic benefits.

Core Entities

Models

SFTPythia-1B based reward model (TL;DR)GPT2-large based reward model (Anthropic-HH)GPT-4o mini (LLM-as-a-judge)TRL (custom fork) training library

Metrics

RLHF reward model scoreSFTLLM-as-a-judge win-ratewall-clock training time

Datasets

TL;DR (summarisation)Anthropic-HH (single-turn dialogue)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KLQ matches PPO on the LM-RLHF reward objective and uses about the same compute per run.

KLQ's final policies are consistently preferred by an LLM judge compared to PPO across tested KL penalties.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

Key finding

Reduce multimodal model hallucinations by learning from segment-level human corrections

Key finding

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

Key finding

FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Key finding

Train LLMs to say “I don't know”: integrate unanswerability detection and RLHF to cut hallucinations to ~1%

Key finding