RAIN: align frozen LLMs at inference by self-evaluation and token rewinding

September 13, 20237 min

Overview

Decision SnapshotNeeds Validation

Method is ready as an inference-time plug-in and shows consistent gains on public benchmarks; expect higher gains and more reliable self-eval with larger open models, but pay a ~4× latency cost and note absence of adaptive-attack evaluations.

Citations7

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.

Who Should Care

Summary TLDR

RAIN is an inference-time algorithm that lets frozen language models self-align by alternating generation, self-evaluation (via a prompt), and rewinded search over token sequences. It requires no extra data or parameter updates. On safety benchmarks it raises harmlessness (e.g., LLaMA 30B from 82% → 97%) and improves truthfulness (e.g., LLaMA-2-chat 13B +5%), at an average time cost of ~3.8–4.4×. Effectiveness grows with model size; self-evaluation accuracy is much higher for large models.

Problem Statement

Finetuning for alignment is costly, risky, and data-hungry. Can a frozen pretrained LLM be made to follow human preferences at inference time without any training data or parameter updates?

Main Contribution

Propose RAIN, an inference-only alignment method combining self-evaluation and rewindable search over token sequences.

Design a PUCT-like inner search with similarity-based updates and dynamic node addition to guide token selection.

Key Findings

RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.

Numbers82%97%

Practical UseIf you deploy LLaMA 30B, enabling RAIN at inference can cut harmful outputs substantially without retraining.

Evidence RefAbstract; Fig.2; Table5

RAIN improved truthfulness for LLaMA-2-chat 13B by about 5% on TruthfulQA.

Numbers+5% True rate

Practical UseApply RAIN to already-aligned chat models to squeeze further factuality gains without extra annotation.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Harmlessness (LLaMA 30B)82%97%Vanilla autoregressive 82%+15 ppHHAbstract; Fig.2; Table5Table5
Truthful + Informative (LLaMA-2-chat 13B)True+Info 68.5%72.8%Vanilla 68.5%+4.3 ppTruthfulQATable 2 (True+Info)Table2

What To Try In 7 Days

Run RAIN as a plug-in on a dev instance of your production LLM and compare harmlessness/helpfulness on a held-out prompt set.

Tune the self-evaluation prompt and score threshold V to balance safety vs. verbosity.

Measure end-to-end latency and decide whether to use RAIN live or use it to generate aligned training data for later finetuning.

Optimization Features

Inference Optimization
PUCT-like search over token setsrewindable generation (undo tokens during search)similarity-based attribute updates and dynamic node addition

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Inference time increases by ~3.8–4.4×; may be unsuitable for strict latency budgets.

Effectiveness depends on self-evaluation accuracy, which is weak on small models.

When Not To Use

Low-latency production paths where 4× slowdown is unacceptable.

Small models where self-evaluation accuracy is near random.

Failure Modes

Self-evaluation errors lead to reinforcing incorrect decisions.

Search may get stuck in local optima, missing better but low-probability outputs.

Core Entities

Models

LLaMA (7B,13B,30B,65B)LLaMA-2 (7B,13B,70B)LLaMA-2-chat (13B)Vicuna (7B,13B,33B)Alpaca 7BGPT-neo (1.3B,2.7B)

Metrics

harmlessness ratehelpfulness ratetruthfulnessattack success rate (ASR)Accuracyinference time ratio

Datasets

Anthropic HH (Helpfulness and Harmlessness)AdvBench (Zou et al. 2023)TruthfulQAIMDB (controlled sentiment)

Benchmarks

HHAdvBenchTruthfulQA