TÜLU 2: a public suite of finetuned LLaMA-2 and Code-LLaMA models, a new instruction-data mix, and large-scale DPO at 70B

November 17, 20238 min

Overview

Decision SnapshotReady For Pilot

The paper runs broad, reproducible benchmarks and releases artifacts; claims about DPO and dataset gains are supported across multiple evaluations but remain bounded by the listed benchmarks and known dataset contamination caveats.

Citations8

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TÜLU 2 provides high-quality, open instruction-tuned models and data that approach proprietary baselines for many tasks; DPO improves user-facing outputs and is feasible at 70B without private infra, while CODE TÜLU2 gives a fast route to strong code models.

Who Should Care

Summary TLDR

TÜLU 2 is a public release of instruction-finetuned LLaMA-2 and CODE LLaMA models plus a new instruction-data mixture (TÜLU-V2-mix). Key takeaways: the V2 mix improves open-ended and coding tasks over the prior mix; Direct Preference Optimization (DPO) scales stably to 70B and raises open-ended generation quality; QLoRA (parameter-efficient finetuning) lags on long-form generation; CODE TÜLU 2 substantially boosts code ability but can reduce general conversational performance. All models, data, and code are released.

Problem Statement

Open instruction-tuned models evolve rapidly (better bases, distilled datasets, new finetuning methods). The paper asks: what combination of new base models, datasets, and training recipes gives the best open instruction-following models, and how do pragmatic choices (DPO, QLoRA, CODE pretraining) trade performance across tasks?

Main Contribution

TÜLU-V2-mix: a curated instruction dataset mixture (326k samples) combining high-quality human and distilled data.

TÜLU 2 model suite: LLaMA-2-based instruction-tuned models at 7B/13B/70B and CODE LLAMA variants (7B/13B/34B) finetuned on V2 mix.

Key Findings

The V2 data mixture improves average downstream performance over the prior V1 mixture.

NumbersV2 > V1 by ~8% avg (paper intro)

Practical UseUse the TÜLU-V2-mix instead of the V1 mix to boost overall instruction-following performance, especially for open-ended tasks.

Evidence RefIntroduction, Table 2

DPO training raises open-ended generation quality (AlpacaEval) across sizes.

NumbersAlpacaEval +13% avg across sizes (paper intro); TÜLU2+DPO 70B 95.1% vs TÜLU2 70B 86.6% (∆ +8.5)

Practical UseIf you have preference-ranked data, apply DPO to improve user-facing responses and win-rate in automatic human-preference evaluations.

Evidence RefIntroduction, Table 1, Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AlpacaEval win-rateTÜLU 2+DPO 70B: 95.1% vs TÜLU 2 70B: 86.6% (∆ +8.5)TÜLU 2 70B+8.5AlpacaEval (GPT-4 judge)Table 1, Table 4Table 4
MT-Bench average scoreTÜLU 2+DPO 70B: 7.89 vs TÜLU 2 70B: 7.49 (∆ +0.40)TÜLU 2 70B+0.40MT-Bench (GPT-4 judge)Table 4, D Full MT-Bench ResultsTable 4

What To Try In 7 Days

Swap your instruction finetune data to the TÜLU-V2-mix and retune a smaller model to measure gains.

If you have preference pairs, try DPO on a mid-size model to see immediate AlpacaEval/MT-Bench boosts.

Test QLoRA only if compute-limited; validate on your open-ended generation tasks before deploying.

Optimization Features

Infra Optimization
TPU v3 pods (256/512 chips used)
Model Optimization
LoRA
System Optimization
Extended context length to 8,192 tokens
Training Optimization
DPODirect preference optimizationSFT

Reproducibility

Risks & Boundaries

Limitations

DPO and SFT mixes are majority English; DPO can reduce multilingual performance without multilingual data.

QLoRA underperforms for open-ended long-form generation and may not be a drop-in replacement for full finetuning.

When Not To Use

Avoid DPO if multilingual outputs are required and multilingual preference data is missing.

Avoid QLoRA for products that rely on long, open-ended, high-quality generations.

Failure Modes

DPO increases verbosity which may harm concise-answer use cases.

Training-data contamination can bias evaluation (paper omits some comparisons when contamination exists).

Core Entities

Models

TÜLU 2TÜLU 2+DPOCODE TÜLU 2LLAMA-2CODE LLAMALoRADPO

Metrics

AccuracyPass@10AlpacaEval win-rateMT-Bench scoreToxiGen % toxicTruthfulQA % informative+truthful

Datasets

TÜLU-V2-mixTÜLU-V1-mixShareGPTFLAN v2WizardLM Evol-Instruct V2Open-OrcaLIMAGPT4-AlpacaCode-AlpacaUltraFeedbackScience literature mix

Benchmarks

MMLUGSM8kBBHTydiQACodex-Eval (HumanEval)AlpacaEvalMT-BenchToxiGenTruthfulQA

Context Entities

Models

GPT-4GPT-3.5-turboZephyr-BetaXwin-LMLLAMA-2-chat

Metrics

Average (naive) across tasksPer-benchmark deltas

Datasets

UltraChatOpenAssistantShareGPT processed

Benchmarks

AlpacaEval leaderboardMT-Bench leaderboard