Practical review of data, training, and evaluation methods to align LLMs with human preferences

July 24, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

54

Authors

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu

Links

Abstract / PDF

Why It Matters For Business

Aligning LLMs reduces risky outputs and increases usefulness; using parameter-efficient tuning cuts compute costs and enables faster iteration.

Summary TLDR

This survey summarizes how researchers collect instruction data, train LLMs to follow human preferences, and evaluate alignment. It covers supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) and offline ranking/language-based alternatives, plus parameter-efficient tuning (LoRA/QLoRA). The paper reviews closed- and open-set benchmarks, human and LLM-based evaluators, known evaluator biases, and gaps like non-English support and fine-grained instruction management.

Problem Statement

Large pretrained LLMs can produce fluent but misaligned outputs: they may ignore instructions, be biased, or hallucinate facts. Aligning them requires better training data, stable training methods that encode human preferences, and evaluation protocols that capture real-world behavior.

Main Contribution

Survey of instruction data sources: human benchmarks, crowd collections, and synthetic data from strong LLMs

Review of alignment training: SFT, RLHF, offline ranking, language-prefix methods, and parameter-efficient approaches

Summary of evaluation: closed/open benchmarks, human and LLM-based evaluation, and evaluator biases

Catalog of popular aligned models and a shortlist of open research directions

Key Findings

Small sets of high-quality instructions can suffice to produce alignment effects.

NumbersLLaMA needs ~8K instructions (IFS); other work reports ~6K high-quality instructions

Adding programming instructions can boost reasoning without hurting conversational skills.

Numbers≈50% programming instructions improved reasoning in Muennighoff et al.

Parameter-efficient finetuning lets large models be tuned on modest hardware.

NumbersQLoRA enables fine-tuning a 65B model on a single 48GB GPU using 4-bit quantization

LLM-based evaluators can match humans but show systematic biases.

NumbersEvaluators show positional, length, and self-enhancement biases (multiple studies)

Specialized small evaluators can approach closed-source LLM performance.

NumbersPandaLM trained on ~300K synthetic eval instructions achieved near GPT-3.5/GPT-4 parity on meta-evaluation

Results

Instruction count for alignment (IFS)

Value≈8K instructions to reach high IFS for LLaMA

High-quality instruction sufficiency

Value≈6K high-quality instructions can suffice

Memory-efficient fine-tuning

ValueFine-tune 65B on single 48GB GPU

Evaluation-model training size

ValuePandaLM trained on ~300K synthetic eval instructions

BaselineGPT-3.5/GPT-4

Programming-data share effect

Value≈50% programming instructions improved reasoning

Who Should Care

What To Try In 7 Days

Seed an instruction set from ShareGPT and popular QA sites for your domain

Fine-tune a base LLaMA using LoRA on a small high-quality instruction sample (≈5–10K)

Set up pairwise evaluation (human or GPT-4) and mitigate LLM-evaluator bias by randomizing order

Optimization Features

Token Efficiency

  • Specialized tokenizers for non-English (Chinese tokenizer example)

Infra Optimization

  • LoRA

Model Optimization

  • LoRA

System Optimization

  • Paged optimizers to handle memory spikes

Training Optimization

  • Early-stopping via IFS
  • RAFT sample selection
  • DPO and PRO ranking objectives

Inference Optimization

  • Quantized backbone for lower memory

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey is English-biased; non-English alignment is under-explored
  • RLHF remains costly and unstable in practice
  • LLM-based evaluators show positional and self-enhancement bias
  • Mixing diverse instruction sources lacks clear best practices

When Not To Use

  • If you need step-by-step code for a new algorithm — this is a survey, not an implementation guide
  • If your use case is a low-resource language without adapted tokenizers or data

Failure Modes

  • Overfitting when using parameter-efficient adapters on small datasets
  • Evaluator bias (positional, length, self-preference) leading to misleading scores
  • Semantic drift if excessive synthetic instructions change model behavior

Core Entities

Models

  • GPT-3
  • ChatGPT
  • GPT-4
  • LLaMA
  • Vicuna
  • Alpaca
  • WizardLM
  • WizardCoder
  • Orca
  • Phi-1
  • PandaLM

Metrics

  • Win Rate
  • Elo rating
  • Pairwise preference
  • BERTScore
  • Acceptability levels

Datasets

  • ShareGPT
  • Alpaca
  • Super-NaturalInstructions
  • databricks-dolly-15k
  • OpenAssistant
  • HumanEval
  • MMLU
  • GSM8K

Benchmarks

  • MMLU
  • GSM8K
  • HumanEval
  • MT-Bench
  • FLASK
  • AlpacaEval
  • Vicuna-80