Practical survey of methods, attacks, and evaluations for aligning large language models

September 26, 20237 min

Overview

Decision SnapshotNeeds Validation

This is a synthesis of literature rather than new experiments; it gives a clear map of methods and failure modes but many proposals (scalable oversight, inner-alignment fixes) remain unproven empirically.

Citations34

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 45%

Novelty: 40%

Authors

Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong

Links

Abstract / PDF

Why It Matters For Business

Misaligned LLMs can produce legal, reputational, and safety failures. Alignment methods reduce harmful outputs but need governance, red-teaming, and evaluation to manage adversarial and privacy risks.

Who Should Care

Summary TLDR

This 76-page survey organizes LLM alignment work into outer alignment (how we specify goals), inner alignment (how learned objectives can differ), and mechanistic interpretability (reverse-engineering models). It reviews dominant tools (RLHF and supervised fine-tuning), scalable oversight approaches (task decomposition, constitutional AI, debate), vulnerabilities (privacy leaks, backdoors, adversarial prompts), and evaluation resources (benchmarks and human/LLM evaluators). The paper highlights practical gaps: human feedback is a bottleneck, interpretability is mostly toy-scale, and scalable oversight still needs empirical validation.

Problem Statement

LLMs are powerful but can produce biased, false, private, or harmful outputs. Training objectives (e.g., next-token prediction) do not guarantee human-aligned behavior. The survey collects and organizes methods, failure modes, attacks, and evaluation tools to guide research and practice on aligning LLMs to human values.

Main Contribution

Taxonomy that splits LLM alignment into outer alignment, inner alignment, and mechanistic interpretability.

Comprehensive review of outer alignment methods: RLHF, supervised approaches, and scalable oversight proposals.

Key Findings

Reinforcement Learning from Human Feedback (RLHF) is the most common non-recursive oversight method for aligning LLMs.

Practical UseIf you fine-tune chat assistants, expect to use RLHF or comparable reward-ranking methods and plan for reward-model design and stability testing.

Evidence RefSection 4.3.1 describes RLHF as 'currently the most commonly used non-recursive'

A very small supervised instruction dataset can produce strong alignment gains in practice (LIMA).

NumbersLIMA fine-tunes on 1,000 instruction–response pairs

Practical UseBefore investing in large-scale RLHF, try small, carefully curated SFT instruction datasets to get quick alignment improvements.

Evidence RefSection 4.3.2 cites LIMA using 1,000 pairs

What To Try In 7 Days

Run red-team prompt attacks and jailbreak tests against deployed chat endpoints to find easy failures.

Evaluate model outputs with pairwise LLM comparisons plus spot human checks to detect bias and hallucination.

Fine-tune a small instruction dataset (≈1,000 examples) to see fast alignment gains before large-scale RLHF.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes literature; it contains no new experiments or reproducible code.

Mechanistic interpretability evidence is limited to small or toy models and may not scale to production LLMs.

When Not To Use

When you need reproducible experimental results or a code recipe to replicate an alignment method.

When you need production-ready interpretability tools for large LLMs.

Failure Modes

Reward modeling / RLHF reward hacking and misgeneralization

Deceptive alignment where learned objectives diverge from specified goals

Core Entities

Models

GPT-4ChatGPTGPT-3VicunaGLM-130B

Metrics

pairwise preferencesingle-answer gradingfactual consistency (ALIGNSCORE)toxicity scores

Datasets

LIMA (1k instructions)RealToxicityPromptsTOXIGENETHICSMT-bench

Benchmarks

TruthfulQAALIGNSCOREFLASKMTbenchBIG-bench HHH

Context Entities

Models

BLOOMLLaMA/Llama 2PaLM

Metrics

AccuracyF1 (coreference)safety score (Sun et al. leaderboard)

Datasets

RealToxicityPrompts (reference)ETHICS (reference)

Benchmarks

StereosetCrowS-PairsBBQWinogender / WinoBias