RAIN: align frozen LLMs at inference by self-evaluation and token rewinding

September 13, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

7

Authors

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang

Links

Abstract / PDF

Why It Matters For Business

RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.

Summary TLDR

RAIN is an inference-time algorithm that lets frozen language models self-align by alternating generation, self-evaluation (via a prompt), and rewinded search over token sequences. It requires no extra data or parameter updates. On safety benchmarks it raises harmlessness (e.g., LLaMA 30B from 82% → 97%) and improves truthfulness (e.g., LLaMA-2-chat 13B +5%), at an average time cost of ~3.8–4.4×. Effectiveness grows with model size; self-evaluation accuracy is much higher for large models.

Problem Statement

Finetuning for alignment is costly, risky, and data-hungry. Can a frozen pretrained LLM be made to follow human preferences at inference time without any training data or parameter updates?

Main Contribution

Propose RAIN, an inference-only alignment method combining self-evaluation and rewindable search over token sequences.

Design a PUCT-like inner search with similarity-based updates and dynamic node addition to guide token selection.

Demonstrate safety and truthfulness gains across open-source LLMs and benchmarks without finetuning.

Show improved robustness to static adversarial suffix attacks (AdvBench/GCG) and provide ablations for core components.

Release code as a plug-in-style inference module (no model changes required).

Key Findings

RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.

Numbers82% → 97%

RAIN improved truthfulness for LLaMA-2-chat 13B by about 5% on TruthfulQA.

Numbers+5% True rate

RAIN strongly reduced static adversarial attack success rates (example: Vicuna 33B white-box 94% → 19%).

Numbers94% → 19% (white-box)

Self-evaluation accuracy grows with model size (e.g., LLaMA v1 30B 81%, v1 65B 84%, LLaMA-2 70B 98%).

Numbers30B:81%; 65B:84%; v2 70B:98%

RAIN incurs an inference time overhead of roughly 3.8×–4.4× compared to vanilla autoregressive decoding.

NumbersTime ratio ≈ 3.78–4.36×

Results

Harmlessness (LLaMA 30B)

Value82% → 97%

BaselineVanilla autoregressive 82%

Truthful + Informative (LLaMA-2-chat 13B)

ValueTrue+Info 68.5% → 72.8%

BaselineVanilla 68.5%

Adversarial attack success (Vicuna 33B, white-box)

Value94% → 19%

BaselineVanilla auto-regressive 94%

Accuracy

Value30B:81%, 65B:84%, LLaMA-2 70B:98%

Baselinerandom ~50%

Inference time overhead

ValueRAIN/Vanilla ≈ 3.78–4.36×

BaselineVanilla autoregressive inference

Who Should Care

What To Try In 7 Days

Run RAIN as a plug-in on a dev instance of your production LLM and compare harmlessness/helpfulness on a held-out prompt set.

Tune the self-evaluation prompt and score threshold V to balance safety vs. verbosity.

Measure end-to-end latency and decide whether to use RAIN live or use it to generate aligned training data for later finetuning.

Optimization Features

Inference Optimization

  • PUCT-like search over token sets
  • rewindable generation (undo tokens during search)
  • similarity-based attribute updates and dynamic node addition

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Inference time increases by ~3.8–4.4×; may be unsuitable for strict latency budgets.
  • Effectiveness depends on self-evaluation accuracy, which is weak on small models.
  • Evaluations focus on static adversarial attacks; no guarantee against adaptive attackers.
  • Relies on prompt-design for self-evaluation; poorly chosen prompts can bias results.

When Not To Use

  • Low-latency production paths where 4× slowdown is unacceptable.
  • Small models where self-evaluation accuracy is near random.
  • Against adaptive adversaries engineered to exploit the search and self-eval loop.

Failure Modes

  • Self-evaluation errors lead to reinforcing incorrect decisions.
  • Search may get stuck in local optima, missing better but low-probability outputs.
  • Adversaries could craft suffixes that fool self-evaluation or exploit similarity updates.
  • Higher cost of inference may make deployment impractical without downstream finetuning.

Core Entities

Models

  • LLaMA (7B,13B,30B,65B)
  • LLaMA-2 (7B,13B,70B)
  • LLaMA-2-chat (13B)
  • Vicuna (7B,13B,33B)
  • Alpaca 7B
  • GPT-neo (1.3B,2.7B)

Metrics

  • harmlessness rate
  • helpfulness rate
  • truthfulness
  • attack success rate (ASR)
  • Accuracy
  • inference time ratio

Datasets

  • Anthropic HH (Helpfulness and Harmlessness)
  • AdvBench (Zou et al. 2023)
  • TruthfulQA
  • IMDB (controlled sentiment)

Benchmarks

  • HH
  • AdvBench
  • TruthfulQA