Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

July 18, 202310 min

Overview

Decision SnapshotNeeds Validation

The paper reports extensive human and automatic evaluations and provides released weights and code; results generalize on the tested prompt sets but require application-specific safety tuning before production.

Citations2,595

Evidence Strength0.80

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/6

Reproducibility

Status: Partial assets available

Open source: Partial

License: Custom Meta commercial license (see model page)

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 30%

Authors

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom

Links

Abstract / PDF / Code

Why It Matters For Business

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Who Should Care

Summary TLDR

Meta releases Llama 2: pretrained transformer LLMs (7B, 13B, 34B, 70B) trained on ~2T tokens and fine-tuned chat variants (Llama 2‑Chat) using supervised fine-tuning + RLHF. The authors publish models and code and report: extensive human evaluations (~4k prompts) where Llama 2‑Chat is ahead of other open models and competitive with some closed models; safety tuning reduces toxic outputs to near 0% on automatic metrics; reward-model data totals ~1.4M human pairwise comparisons. Paper documents engineering choices (4k context, grouped-query attention), the fine-tuning pipeline (SFT → iterative reward modeling → Rejection Sampling + PPO), and safety practices (context distillation, red‑teaming,

Problem Statement

Open pretrained LLMs match base capabilities of closed models but are not tuned for safe, usable chat. The paper aims to close that gap by releasing pretrained Llama 2 models and describing a reproducible pipeline (SFT + RLHF, reward models, safety tuning) that yields chat models with strong helpfulness and safety on the authors' evaluations.

Main Contribution

Release of Llama 2 family: pretrained 7B, 13B, 34B (not released), 70B and Llama 2‑Chat tuned 7B/13B/70B for dialogue

Detailed, reproducible fine-tuning pipeline: curated SFT, large-scale human preference data (~1.4M comparisons), reward models, iterative RLHF (Rejection Sampling + PPO)

Key Findings

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers2.0T tokens; sizes 7B,13B,34B,70B

Practical UseExpect base knowledge comparable to modern open models; plan for large compute footprints for reproduction.

Evidence RefSection 2.1; Table 1

Human preference dataset for reward modeling exceeds 1.4 million binary comparisons.

Numbers1,418,091 comparisons (Meta preference data)

Practical UseHigh-quality RLHF requires large, diverse human comparisons; budget annotation accordingly or adopt smaller proxy approaches.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human helpfulness win-rate vs ChatGPT (70B Chat)win 36%; tie 31.5%; loss 32.5%ChatGPT gpt-3.5-turbo-0301~4,000 single+multi-turn promptsFigure 1; Section 3.4.2Fig.1
Accuracyavg 70.6% across test sets reportedSteamSHP-XL / Open Assistant / GPT-4Meta Helpfulness and open-source RM datasetsTable 7 (HelpfulnessRM avg 70.6%)Table 7

What To Try In 7 Days

Run the released 7B or 13B Llama 2‑Chat locally on a representative prompt set to measure gap vs your product.

Use the provided SFT+RM recipes to create a small reward-model with your domain prompts to bootstrap safer alignment.

Apply GAtt-style system-message augmentation in your multi-turn flows to improve instruction persistence.

Agent Features

Tool Use

Emergent zero-shot simple tool usage (model observed to call APIs/sequence tools without explicit to

Optimization Features

Token Efficiency
4k context window to cover longer dialogues and documents
Infra Optimization
Demonstrated RoCE (commodity RDMA) scales nearly as well as Infiniband to 2000 GPUs
Model Optimization
Grouped-Query Attention (GQA) to reduce KV cache memory for large context inferenceSwiGLU activations, RMSNorm as in Llama family
System Optimization
Rejection sampling then PPO pipeline to distill large-model capabilities into smaller models
Training Optimization
Up-sampling factual sources in pretrainingCosine LR schedules and warmups, AdamW
Inference Optimization
GQA enables higher throughput and lower KV memory; 8‑GPU hosting with tensor parallelismFSDP used for fast large-batch training; weight consolidation before generation to speedups

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseCustom Meta commercial license (see model page)

Risks & Boundaries

Limitations

Evaluations are English-heavy; non-English performance and safety are less tested

Human evaluations and reward models can be subjective and biased toward the authors' data and guidelines

When Not To Use

Direct deployment in high-stakes domains (medical, legal, critical infrastructure) without domain-specific safety tuning and human oversight

Non-English or low-resource language products without thorough testing

Failure Modes

Hallucinations and confident false statements on factual queries

False refusals (overly conservative behavior) on borderline benign prompts after safety tuning

Core Entities

Models

Llama 2Llama 2‑ChatLlama 2 (7B,13B,34B,70B)Llama 1VicunaFalconMPTGPT-3.5GPT-4PaLM

Metrics

Human win-rate (%)Violation percentage (safety)Pass@1 (code)AccuracyTruthful+Informative %Carbon (tCO2 eq)

Datasets

Meta preference dataset (1.4M comparisons)TruthfulQAToxiGenBOLDHumanEvalMBPPGSM8KMMLUBBHAGI Eval

Benchmarks

MMLUBBHHumanEvalGSM8KTruthfulQAToxiGenBOLDMATHNaturalQuestionsTriviaQA

Context Entities

Models

GPT-3.5 (gpt-3.5-turbo-0301)PaLM-BisonVicuna-13bVicuna-33bFalcon-40B-instructMPT-7b-chat

Datasets

Anthropic Helpful/HarmlessOpenAI SummarizeOpenAI WebGPTHuggingFace StackExchange preferencesStanford SHPSynthetic GPT-J preference data

Benchmarks

HellaSwagPIQABoolQCommonsenseQA