Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
7
Why It Matters For Business
RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.
Summary TLDR
RAIN is an inference-time algorithm that lets frozen language models self-align by alternating generation, self-evaluation (via a prompt), and rewinded search over token sequences. It requires no extra data or parameter updates. On safety benchmarks it raises harmlessness (e.g., LLaMA 30B from 82% → 97%) and improves truthfulness (e.g., LLaMA-2-chat 13B +5%), at an average time cost of ~3.8–4.4×. Effectiveness grows with model size; self-evaluation accuracy is much higher for large models.
Problem Statement
Finetuning for alignment is costly, risky, and data-hungry. Can a frozen pretrained LLM be made to follow human preferences at inference time without any training data or parameter updates?
Main Contribution
Propose RAIN, an inference-only alignment method combining self-evaluation and rewindable search over token sequences.
Design a PUCT-like inner search with similarity-based updates and dynamic node addition to guide token selection.
Demonstrate safety and truthfulness gains across open-source LLMs and benchmarks without finetuning.
Show improved robustness to static adversarial suffix attacks (AdvBench/GCG) and provide ablations for core components.
Release code as a plug-in-style inference module (no model changes required).
Key Findings
RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.
RAIN improved truthfulness for LLaMA-2-chat 13B by about 5% on TruthfulQA.
RAIN strongly reduced static adversarial attack success rates (example: Vicuna 33B white-box 94% → 19%).
Self-evaluation accuracy grows with model size (e.g., LLaMA v1 30B 81%, v1 65B 84%, LLaMA-2 70B 98%).
RAIN incurs an inference time overhead of roughly 3.8×–4.4× compared to vanilla autoregressive decoding.
Results
Harmlessness (LLaMA 30B)
Truthful + Informative (LLaMA-2-chat 13B)
Adversarial attack success (Vicuna 33B, white-box)
Accuracy
Inference time overhead
Who Should Care
What To Try In 7 Days
Run RAIN as a plug-in on a dev instance of your production LLM and compare harmlessness/helpfulness on a held-out prompt set.
Tune the self-evaluation prompt and score threshold V to balance safety vs. verbosity.
Measure end-to-end latency and decide whether to use RAIN live or use it to generate aligned training data for later finetuning.
Optimization Features
Inference Optimization
- PUCT-like search over token sets
- rewindable generation (undo tokens during search)
- similarity-based attribute updates and dynamic node addition
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Inference time increases by ~3.8–4.4×; may be unsuitable for strict latency budgets.
- Effectiveness depends on self-evaluation accuracy, which is weak on small models.
- Evaluations focus on static adversarial attacks; no guarantee against adaptive attackers.
- Relies on prompt-design for self-evaluation; poorly chosen prompts can bias results.
When Not To Use
- Low-latency production paths where 4× slowdown is unacceptable.
- Small models where self-evaluation accuracy is near random.
- Against adaptive adversaries engineered to exploit the search and self-eval loop.
Failure Modes
- Self-evaluation errors lead to reinforcing incorrect decisions.
- Search may get stuck in local optima, missing better but low-probability outputs.
- Adversaries could craft suffixes that fool self-evaluation or exploit similarity updates.
- Higher cost of inference may make deployment impractical without downstream finetuning.
Core Entities
Models
- LLaMA (7B,13B,30B,65B)
- LLaMA-2 (7B,13B,70B)
- LLaMA-2-chat (13B)
- Vicuna (7B,13B,33B)
- Alpaca 7B
- GPT-neo (1.3B,2.7B)
Metrics
- harmlessness rate
- helpfulness rate
- truthfulness
- attack success rate (ASR)
- Accuracy
- inference time ratio
Datasets
- Anthropic HH (Helpfulness and Harmlessness)
- AdvBench (Zou et al. 2023)
- TruthfulQA
- IMDB (controlled sentiment)
Benchmarks
- HH
- AdvBench
- TruthfulQA

