KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

August 23, 20257 min

Overview

Decision SnapshotNeeds Validation

KLQ is theoretically clearer and matches PPO on small-scale tests; evidence is promising but limited by noisy, single-run style experiments and modest compute scale.

Citations0

Evidence Strength0.50

Confidence0.72

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 70%

Authors

Jason R Brown, Lennie Wells, Edward James Young, Sergio Bacallado

Links

Abstract / PDF / Data

Why It Matters For Business

KLQ offers a theoretically grounded alternative to PPO that matches compute and reward performance and can produce outputs preferred by LLM judges, potentially improving product output quality without added infrastructure cost.

Who Should Care

Summary TLDR

The authors introduce KLQ, an on-policy token-level Q-learning method for RLHF that uses a KL-regularised action-value decomposition and λ-returns. KLQ has a clearer theoretical motivation and is shown analytically to be equivalent to a modified PPO. Empirically on TL;DR summarisation and Anthropic-HH single-turn dialogue, KLQ matches PPO on reward metrics and per-update compute, and yields higher win-rates in LLM-as-a-judge comparisons, though experiments are small-scale and noisy.

Problem Statement

PPO is the de-facto algorithm for LM-RLHF but handles the KL constraint heuristically. We need a token-level action-value method that (1) fits the KL-regularised LM setting, (2) lets you initialise from SFT policies, and (3) offers clearer theory and comparable or better empirical outcomes.

Main Contribution

KLQ: a token-level, KL-regularised Q-learning algorithm for LM-RLHF that trains an action-value function parametrised by the policy and a value head.

A λ-return based value estimator adapted to action-value learning to propagate sparse final reward signals across tokens.

Key Findings

KLQ matches PPO on the LM-RLHF reward objective and uses about the same compute per run.

Numbers75,000 episodes per run; training ~5 hours on 4 A100 GPUs; near-identical final validation rewards (figures shown).

Practical UseIf you currently use PPO for RLHF, you can switch to KLQ without extra compute and expect similar reward optimization out of the box.

Evidence RefSections 5.1, E.1; Table 3; Figure 1

KLQ's final policies are consistently preferred by an LLM judge compared to PPO across tested KL penalties.

NumbersEvaluations used 32 validation prompts and both orderings (64 comparisons) per KL coefficient; KLQ had higher win-rate '

Practical UseUse KLQ if human-like judgments (via LLM-as-a-judge) matter; it may yield higher-quality completions while matching numeric reward.

Evidence RefSection 5.3; Figure 3; Appendix F

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Final RLHF reward (validation)KLQ ≈ PPO (similar final validation reward curves)PPOno significant gapTL;DR, Anthropic-HH (validation)Figure 1; Figure 4; Section 5.1Sections 5.1, E.1; Figures 1 and 4
Compute per run (wall-clock)≈5 hours on 4 A100 GPUsPPOnegligible difference (per Table 3)TL;DR and HHTable 3; Section 5Appendix E.1 Table 3

What To Try In 7 Days

Run a small-scale KLQ finetune from your SFT model on a representative dataset to compare reward and judged quality.

Use the same hyperparameters and compute budget as your PPO runs to isolate algorithmic differences.

Run an LLM-as-a-judge comparison on ~30 prompts to get a quick signal of qualitative preference.

Optimization Features

Token Efficiency
token-level (per-token) reward attribution via KL regularisation
Training Optimization
λ-returns for action-values (propagates sparse final rewards)action-value decomposition linking Q to policy and value headon-policy minibatch gradient updates (multiple epochs per batch)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

TL;DR datasetAnthropic-HH datasetHuggingFace model hub (used SFT and reward models)

Risks & Boundaries

Limitations

Experiments are small-scale: 75k episodes per run and ~5 hours on 4 A100 GPUs; few repeats.

Used off-the-shelf SFT and reward models from HuggingFace—reward model quality and biases not controlled.

When Not To Use

If you need thoroughly validated large-scale RLHF results; KLQ currently has only preliminary-scale evidence.

If you cannot add a token-level value head or prefer completion-level grouped-rollout methods without value heads.

Failure Modes

Training noise can dominate small differences between KLQ and PPO, leading to ambiguous outcomes.

Poor reward models or SFT policies may bias both methods and hide algorithmic benefits.

Core Entities

Models

SFTPythia-1B based reward model (TL;DR)GPT2-large based reward model (Anthropic-HH)GPT-4o mini (LLM-as-a-judge)TRL (custom fork) training library

Metrics

RLHF reward model scoreSFTLLM-as-a-judge win-ratewall-clock training time

Datasets

TL;DR (summarisation)Anthropic-HH (single-turn dialogue)