Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

July 18, 202310 min

Overview

Production Readiness

0.7

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

2,595

Authors

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom

Links

Abstract / PDF

Why It Matters For Business

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Summary TLDR

Meta releases Llama 2: pretrained transformer LLMs (7B, 13B, 34B, 70B) trained on ~2T tokens and fine-tuned chat variants (Llama 2‑Chat) using supervised fine-tuning + RLHF. The authors publish models and code and report: extensive human evaluations (~4k prompts) where Llama 2‑Chat is ahead of other open models and competitive with some closed models; safety tuning reduces toxic outputs to near 0% on automatic metrics; reward-model data totals ~1.4M human pairwise comparisons. Paper documents engineering choices (4k context, grouped-query attention), the fine-tuning pipeline (SFT → iterative reward modeling → Rejection Sampling + PPO), and safety practices (context distillation, red‑teaming,

Problem Statement

Open pretrained LLMs match base capabilities of closed models but are not tuned for safe, usable chat. The paper aims to close that gap by releasing pretrained Llama 2 models and describing a reproducible pipeline (SFT + RLHF, reward models, safety tuning) that yields chat models with strong helpfulness and safety on the authors' evaluations.

Main Contribution

Release of Llama 2 family: pretrained 7B, 13B, 34B (not released), 70B and Llama 2‑Chat tuned 7B/13B/70B for dialogue

Detailed, reproducible fine-tuning pipeline: curated SFT, large-scale human preference data (~1.4M comparisons), reward models, iterative RLHF (Rejection Sampling + PPO)

Practical safety workflow: targeted safety SFT, context distillation, safety reward model, extensive red‑teaming and quantitative safety evaluations

Engineering changes for scale: doubled context window (4k tokens) and grouped-query attention (GQA) for inference memory/latency gains

Analysis of model and dataset properties: benchmarks (MMLU, GSM8K, HumanEval), contamination checks, carbon footprint estimate

Key Findings

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers2.0T tokens; sizes 7B,13B,34B,70B

Human preference dataset for reward modeling exceeds 1.4 million binary comparisons.

Numbers1,418,091 comparisons (Meta preference data)

Llama 2‑Chat shows competitive human helpfulness vs closed models on the authors' prompt set (4k prompts); 70B chat tied/beat ChatGPT on many examples.

NumbersLlama 2‑Chat 70B win 36% / tie 31.5% vs gpt‑3.5‑turbo‑0301 on ~4k prompts

Safety tuning (SFT + RLHF + context distillation) dramatically reduced toxic outputs in automatic metrics for chat models.

NumbersToxiGen toxic generation ~24.6% (pretrained 70B) → ~0.01% (Llama 2‑Chat 70B)

Cost and carbon: pretraining consumed ~3.3M GPU hours and ~539 tCO2eq (offset).

Numbers3,311,616 GPU hours; 539 tCO2 eq

Engineering changes (4k context, GQA) give measurable task gains and inference scaling benefits.

Numbers4k context improved long‑context benchmarks (see NarrativeQA, Qasper), GQA yields better throughput vs MQA/MHA

Results

Human helpfulness win-rate vs ChatGPT (70B Chat)

Valuewin 36%; tie 31.5%; loss 32.5%

BaselineChatGPT gpt-3.5-turbo-0301

Accuracy

Valueavg 70.6% across test sets reported

BaselineSteamSHP-XL / Open Assistant / GPT-4

Safety toxic generation rate (ToxiGen)

ValueLlama 2‑Chat 70B: 0.01% toxic generations

BaselinePretrained Llama 2 70B: 24.60%

Truthfulness (TruthfulQA true+informative)

ValueLlama 2‑Chat 70B: 64.14%

BaselinePretrained Llama 2 70B: 50.18%

Code generation (HumanEval pass@1)

ValueLlama 2 70B pass@1 = 29.9%

BaselineLlama 1 65B pass@1 = 23.7%

Pretraining compute & carbon

Value≈3.3M GPU hours; 539 tCO2 eq

Who Should Care

What To Try In 7 Days

Run the released 7B or 13B Llama 2‑Chat locally on a representative prompt set to measure gap vs your product.

Use the provided SFT+RM recipes to create a small reward-model with your domain prompts to bootstrap safer alignment.

Apply GAtt-style system-message augmentation in your multi-turn flows to improve instruction persistence.

Agent Features

Tool Use

  • Emergent zero-shot simple tool usage (model observed to call APIs/sequence tools without explicit to

Optimization Features

Token Efficiency

  • 4k context window to cover longer dialogues and documents

Infra Optimization

  • Demonstrated RoCE (commodity RDMA) scales nearly as well as Infiniband to 2000 GPUs

Model Optimization

  • Grouped-Query Attention (GQA) to reduce KV cache memory for large context inference
  • SwiGLU activations, RMSNorm as in Llama family

System Optimization

  • Rejection sampling then PPO pipeline to distill large-model capabilities into smaller models

Training Optimization

  • Up-sampling factual sources in pretraining
  • Cosine LR schedules and warmups, AdamW

Inference Optimization

  • GQA enables higher throughput and lower KV memory; 8‑GPU hosting with tensor parallelism
  • FSDP used for fast large-batch training; weight consolidation before generation to speedups

Reproducibility

License

  • Custom Meta commercial license (see model page)

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations are English-heavy; non-English performance and safety are less tested
  • Human evaluations and reward models can be subjective and biased toward the authors' data and guidelines
  • Safety remains a long‑tail problem; benchmark improvements do not guarantee zero risk in deployment
  • Pretrained models are not safe-to-deploy without extensive downstream safety tuning
  • Some capabilities (34B release) were withheld pending additional red‑teaming

When Not To Use

  • Direct deployment in high-stakes domains (medical, legal, critical infrastructure) without domain-specific safety tuning and human oversight
  • Non-English or low-resource language products without thorough testing
  • Cases requiring provable factual guarantees or audited traceability of sources

Failure Modes

  • Hallucinations and confident false statements on factual queries
  • False refusals (overly conservative behavior) on borderline benign prompts after safety tuning
  • Reward‑model overfitting and distributional shift causing degraded alignment
  • Catastrophic forgetting when iterating on rejection-sampled datasets without preserving past data
  • Long‑tail adversarial prompts that bypass defenses

Core Entities

Models

  • Llama 2
  • Llama 2‑Chat
  • Llama 2 (7B,13B,34B,70B)
  • Llama 1
  • Vicuna
  • Falcon
  • MPT
  • GPT-3.5
  • GPT-4
  • PaLM

Metrics

  • Human win-rate (%)
  • Violation percentage (safety)
  • Pass@1 (code)
  • Accuracy
  • Truthful+Informative %
  • Carbon (tCO2 eq)

Datasets

  • Meta preference dataset (1.4M comparisons)
  • TruthfulQA
  • ToxiGen
  • BOLD
  • HumanEval
  • MBPP
  • GSM8K
  • MMLU
  • BBH
  • AGI Eval

Benchmarks

  • MMLU
  • BBH
  • HumanEval
  • GSM8K
  • TruthfulQA
  • ToxiGen
  • BOLD
  • MATH
  • NaturalQuestions
  • TriviaQA

Context Entities

Models

  • GPT-3.5 (gpt-3.5-turbo-0301)
  • PaLM-Bison
  • Vicuna-13b
  • Vicuna-33b
  • Falcon-40B-instruct
  • MPT-7b-chat

Datasets

  • Anthropic Helpful/Harmless
  • OpenAI Summarize
  • OpenAI WebGPT
  • HuggingFace StackExchange preferences
  • Stanford SHP
  • Synthetic GPT-J preference data

Benchmarks

  • HellaSwag
  • PIQA
  • BoolQ
  • CommonsenseQA