Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

Overview

Decision SnapshotNeeds Validation

The paper reports extensive human and automatic evaluations and provides released weights and code; results generalize on the tested prompt sets but require application-specific safety tuning before production.

Citations2,595

Evidence Strength0.80

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/6

Reproducibility

Status: Partial assets available

Open source: Partial

License: Custom Meta commercial license (see model page)

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 30%

Authors

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom

Links

Abstract / PDF / Code

Why It Matters For Business

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Who Should Care

CTO Product Manager ML Engineer Founder Engineering Lead Data Scientist

Summary TLDR

Meta releases Llama 2: pretrained transformer LLMs (7B, 13B, 34B, 70B) trained on ~2T tokens and fine-tuned chat variants (Llama 2‑Chat) using supervised fine-tuning + RLHF. The authors publish models and code and report: extensive human evaluations (~4k prompts) where Llama 2‑Chat is ahead of other open models and competitive with some closed models; safety tuning reduces toxic outputs to near 0% on automatic metrics; reward-model data totals ~1.4M human pairwise comparisons. Paper documents engineering choices (4k context, grouped-query attention), the fine-tuning pipeline (SFT → iterative reward modeling → Rejection Sampling + PPO), and safety practices (context distillation, red‑teaming,

Problem Statement

Open pretrained LLMs match base capabilities of closed models but are not tuned for safe, usable chat. The paper aims to close that gap by releasing pretrained Llama 2 models and describing a reproducible pipeline (SFT + RLHF, reward models, safety tuning) that yields chat models with strong helpfulness and safety on the authors' evaluations.

Main Contribution

Release of Llama 2 family: pretrained 7B, 13B, 34B (not released), 70B and Llama 2‑Chat tuned 7B/13B/70B for dialogue

Detailed, reproducible fine-tuning pipeline: curated SFT, large-scale human preference data (~1.4M comparisons), reward models, iterative RLHF (Rejection Sampling + PPO)

Key Findings

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers2.0T tokens; sizes 7B,13B,34B,70B

Practical UseExpect base knowledge comparable to modern open models; plan for large compute footprints for reproduction.

Evidence RefSection 2.1; Table 1

Human preference dataset for reward modeling exceeds 1.4 million binary comparisons.

Numbers1,418,091 comparisons (Meta preference data)

Practical UseHigh-quality RLHF requires large, diverse human comparisons; budget annotation accordingly or adopt smaller proxy approaches.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Human helpfulness win-rate vs ChatGPT (70B Chat)	win 36%; tie 31.5%; loss 32.5%	ChatGPT gpt-3.5-turbo-0301	—	~4,000 single+multi-turn prompts	Figure 1; Section 3.4.2	Fig.1
Accuracy	avg 70.6% across test sets reported	SteamSHP-XL / Open Assistant / GPT-4	—	Meta Helpfulness and open-source RM datasets	Table 7 (HelpfulnessRM avg 70.6%)	Table 7

What To Try In 7 Days

Run the released 7B or 13B Llama 2‑Chat locally on a representative prompt set to measure gap vs your product.

Use the provided SFT+RM recipes to create a small reward-model with your domain prompts to bootstrap safer alignment.

Apply GAtt-style system-message augmentation in your multi-turn flows to improve instruction persistence.

Agent Features

Tool Use

Emergent zero-shot simple tool usage (model observed to call APIs/sequence tools without explicit to

Optimization Features

Token Efficiency

4k context window to cover longer dialogues and documents

Infra Optimization

Demonstrated RoCE (commodity RDMA) scales nearly as well as Infiniband to 2000 GPUs

Model Optimization

Grouped-Query Attention (GQA) to reduce KV cache memory for large context inferenceSwiGLU activations, RMSNorm as in Llama family

System Optimization

Rejection sampling then PPO pipeline to distill large-model capabilities into smaller models

Training Optimization

Up-sampling factual sources in pretrainingCosine LR schedules and warmups, AdamW

Inference Optimization

GQA enables higher throughput and lower KV memory; 8‑GPU hosting with tensor parallelismFSDP used for fast large-batch training; weight consolidation before generation to speedups

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseCustom Meta commercial license (see model page)

Code URLs

https://github.com/facebookresearch/llama https://ai.meta.com/resources/models-and-libraries/llama/

Risks & Boundaries

Limitations

Evaluations are English-heavy; non-English performance and safety are less tested

Human evaluations and reward models can be subjective and biased toward the authors' data and guidelines

When Not To Use

Direct deployment in high-stakes domains (medical, legal, critical infrastructure) without domain-specific safety tuning and human oversight

Non-English or low-resource language products without thorough testing

Failure Modes

Hallucinations and confident false statements on factual queries

False refusals (overly conservative behavior) on borderline benign prompts after safety tuning

Core Entities

Models

Llama 2Llama 2‑ChatLlama 2 (7B,13B,34B,70B)Llama 1VicunaFalconMPTGPT-3.5GPT-4PaLM

Metrics

Human win-rate (%)Violation percentage (safety)Pass@1 (code)AccuracyTruthful+Informative %Carbon (tCO2 eq)

Datasets

Meta preference dataset (1.4M comparisons)TruthfulQAToxiGenBOLDHumanEvalMBPPGSM8KMMLUBBHAGI Eval

Benchmarks

MMLUBBHHumanEvalGSM8KTruthfulQAToxiGenBOLDMATHNaturalQuestionsTriviaQA

Context Entities

Models

GPT-3.5 (gpt-3.5-turbo-0301)PaLM-BisonVicuna-13bVicuna-33bFalcon-40B-instructMPT-7b-chat

Datasets

Anthropic Helpful/HarmlessOpenAI SummarizeOpenAI WebGPTHuggingFace StackExchange preferencesStanford SHPSynthetic GPT-J preference data

Benchmarks

HellaSwagPIQABoolQCommonsenseQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Human preference dataset for reward modeling exceeds 1.4 million binary comparisons.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding