Practical review of data, training, and evaluation methods to align LLMs with human preferences

July 24, 20237 min

Overview

Decision SnapshotNeeds Validation

This is a literature survey summarizing many practical methods; it is useful as a roadmap but does not provide new experimental evidence.

Citations54

Evidence Strength0.60

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu

Links

Abstract / PDF / Code

Why It Matters For Business

Aligning LLMs reduces risky outputs and increases usefulness; using parameter-efficient tuning cuts compute costs and enables faster iteration.

Who Should Care

Summary TLDR

This survey summarizes how researchers collect instruction data, train LLMs to follow human preferences, and evaluate alignment. It covers supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) and offline ranking/language-based alternatives, plus parameter-efficient tuning (LoRA/QLoRA). The paper reviews closed- and open-set benchmarks, human and LLM-based evaluators, known evaluator biases, and gaps like non-English support and fine-grained instruction management.

Problem Statement

Large pretrained LLMs can produce fluent but misaligned outputs: they may ignore instructions, be biased, or hallucinate facts. Aligning them requires better training data, stable training methods that encode human preferences, and evaluation protocols that capture real-world behavior.

Main Contribution

Survey of instruction data sources: human benchmarks, crowd collections, and synthetic data from strong LLMs

Review of alignment training: SFT, RLHF, offline ranking, language-prefix methods, and parameter-efficient approaches

Key Findings

Small sets of high-quality instructions can suffice to produce alignment effects.

NumbersLLaMA needs ~8K instructions (IFS); other work reports ~6K high-quality instructions

Practical UsePrioritize a few thousand high-quality instructions over millions of noisy ones when resources are limited.

Evidence RefAlShikh et al. (IFS ≈8K); Zhou et al. (~6K)

Adding programming instructions can boost reasoning without hurting conversational skills.

Numbers≈50% programming instructions improved reasoning in Muennighoff et al.

Practical UseMix in a substantial share of coding/problem-solving examples to strengthen reasoning tasks.

Evidence RefMuennighoff et al. (2023)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Instruction count for alignment (IFS)≈8K instructions to reach high IFS for LLaMAAlShikh et al.IFS classifier shows LLaMA needs ~8K instructionsAlShikh et al. (IFS)
High-quality instruction sufficiency≈6K high-quality instructions can sufficeZhou et al.Zhou et al. report ~6K high-quality instructions align modelsZhou et al.

What To Try In 7 Days

Seed an instruction set from ShareGPT and popular QA sites for your domain

Fine-tune a base LLaMA using LoRA on a small high-quality instruction sample (≈5–10K)

Set up pairwise evaluation (human or GPT-4) and mitigate LLM-evaluator bias by randomizing order

Optimization Features

Token Efficiency
Specialized tokenizers for non-English (Chinese tokenizer example)
Infra Optimization
LoRA
Model Optimization
LoRA
System Optimization
Paged optimizers to handle memory spikes
Training Optimization
Early-stopping via IFSRAFT sample selectionDPO and PRO ranking objectives
Inference Optimization
Quantized backbone for lower memory

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey is English-biased; non-English alignment is under-explored

RLHF remains costly and unstable in practice

When Not To Use

If you need step-by-step code for a new algorithm — this is a survey, not an implementation guide

If your use case is a low-resource language without adapted tokenizers or data

Failure Modes

Overfitting when using parameter-efficient adapters on small datasets

Evaluator bias (positional, length, self-preference) leading to misleading scores

Core Entities

Models

GPT-3ChatGPTGPT-4LLaMAVicunaAlpacaWizardLMWizardCoderOrcaPhi-1PandaLM

Metrics

Win RateElo ratingPairwise preferenceBERTScoreAcceptability levels

Datasets

ShareGPTAlpacaSuper-NaturalInstructionsdatabricks-dolly-15kOpenAssistantHumanEvalMMLUGSM8K

Benchmarks

MMLUGSM8KHumanEvalMT-BenchFLASKAlpacaEvalVicuna-80