TÜLU 2: a public suite of finetuned LLaMA-2 and Code-LLaMA models, a new instruction-data mix, and large-scale DPO at 70B

Overview

Decision SnapshotReady For Pilot

The paper runs broad, reproducible benchmarks and releases artifacts; claims about DPO and dataset gains are supported across multiple evaluations but remain bounded by the listed benchmarks and known dataset contamination caveats.

Citations8

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TÜLU 2 provides high-quality, open instruction-tuned models and data that approach proprietary baselines for many tasks; DPO improves user-facing outputs and is feasible at 70B without private infra, while CODE TÜLU2 gives a fast route to strong code models.

Who Should Care

ML Engineer Product Manager Founder Engineering Lead

Summary TLDR

TÜLU 2 is a public release of instruction-finetuned LLaMA-2 and CODE LLaMA models plus a new instruction-data mixture (TÜLU-V2-mix). Key takeaways: the V2 mix improves open-ended and coding tasks over the prior mix; Direct Preference Optimization (DPO) scales stably to 70B and raises open-ended generation quality; QLoRA (parameter-efficient finetuning) lags on long-form generation; CODE TÜLU 2 substantially boosts code ability but can reduce general conversational performance. All models, data, and code are released.

Problem Statement

Open instruction-tuned models evolve rapidly (better bases, distilled datasets, new finetuning methods). The paper asks: what combination of new base models, datasets, and training recipes gives the best open instruction-following models, and how do pragmatic choices (DPO, QLoRA, CODE pretraining) trade performance across tasks?

Main Contribution

TÜLU-V2-mix: a curated instruction dataset mixture (326k samples) combining high-quality human and distilled data.

TÜLU 2 model suite: LLaMA-2-based instruction-tuned models at 7B/13B/70B and CODE LLAMA variants (7B/13B/34B) finetuned on V2 mix.

Key Findings

The V2 data mixture improves average downstream performance over the prior V1 mixture.

NumbersV2 > V1 by ~8% avg (paper intro)

Practical UseUse the TÜLU-V2-mix instead of the V1 mix to boost overall instruction-following performance, especially for open-ended tasks.

Evidence RefIntroduction, Table 2

DPO training raises open-ended generation quality (AlpacaEval) across sizes.

NumbersAlpacaEval +13% avg across sizes (paper intro); TÜLU2+DPO 70B 95.1% vs TÜLU2 70B 86.6% (∆ +8.5)

Practical UseIf you have preference-ranked data, apply DPO to improve user-facing responses and win-rate in automatic human-preference evaluations.

Evidence RefIntroduction, Table 1, Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
AlpacaEval win-rate	TÜLU 2+DPO 70B: 95.1% vs TÜLU 2 70B: 86.6% (∆ +8.5)	TÜLU 2 70B	+8.5	AlpacaEval (GPT-4 judge)	Table 1, Table 4	Table 4
MT-Bench average score	TÜLU 2+DPO 70B: 7.89 vs TÜLU 2 70B: 7.49 (∆ +0.40)	TÜLU 2 70B	+0.40	MT-Bench (GPT-4 judge)	Table 4, D Full MT-Bench Results	Table 4

What To Try In 7 Days

Swap your instruction finetune data to the TÜLU-V2-mix and retune a smaller model to measure gains.

If you have preference pairs, try DPO on a mid-size model to see immediate AlpacaEval/MT-Bench boosts.

Test QLoRA only if compute-limited; validate on your open-ended generation tasks before deploying.

Optimization Features

Infra Optimization

TPU v3 pods (256/512 chips used)

Model Optimization

LoRA

System Optimization

Extended context length to 8,192 tokens

Training Optimization

DPODirect preference optimizationSFT

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/allenai/open-instruct https://github.com/hamishivi/EasyLM

Data URLs

https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101

Risks & Boundaries

Limitations

DPO and SFT mixes are majority English; DPO can reduce multilingual performance without multilingual data.

QLoRA underperforms for open-ended long-form generation and may not be a drop-in replacement for full finetuning.

When Not To Use

Avoid DPO if multilingual outputs are required and multilingual preference data is missing.

Avoid QLoRA for products that rely on long, open-ended, high-quality generations.

Failure Modes

DPO increases verbosity which may harm concise-answer use cases.

Training-data contamination can bias evaluation (paper omits some comparisons when contamination exists).

Core Entities

Models

TÜLU 2TÜLU 2+DPOCODE TÜLU 2LLAMA-2CODE LLAMALoRADPO

Metrics

AccuracyPass@10AlpacaEval win-rateMT-Bench scoreToxiGen % toxicTruthfulQA % informative+truthful

Datasets

TÜLU-V2-mixTÜLU-V1-mixShareGPTFLAN v2WizardLM Evol-Instruct V2Open-OrcaLIMAGPT4-AlpacaCode-AlpacaUltraFeedbackScience literature mix

Benchmarks

MMLUGSM8kBBHTydiQACodex-Eval (HumanEval)AlpacaEvalMT-BenchToxiGenTruthfulQA

Context Entities

Models

GPT-4GPT-3.5-turboZephyr-BetaXwin-LMLLAMA-2-chat

Metrics

Average (naive) across tasksPer-benchmark deltas

Datasets

UltraChatOpenAssistantShareGPT processed

Benchmarks

AlpacaEval leaderboardMT-Bench leaderboard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The V2 data mixture improves average downstream performance over the prior V1 mixture.

DPO training raises open-ended generation quality (AlpacaEval) across sizes.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding