TÜLU 2: a public suite of finetuned LLaMA-2 and Code-LLaMA models, a new instruction-data mix, and large-scale DPO at 70B

November 17, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

8

Authors

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

Links

Abstract / PDF

Why It Matters For Business

TÜLU 2 provides high-quality, open instruction-tuned models and data that approach proprietary baselines for many tasks; DPO improves user-facing outputs and is feasible at 70B without private infra, while CODE TÜLU2 gives a fast route to strong code models.

Summary TLDR

TÜLU 2 is a public release of instruction-finetuned LLaMA-2 and CODE LLaMA models plus a new instruction-data mixture (TÜLU-V2-mix). Key takeaways: the V2 mix improves open-ended and coding tasks over the prior mix; Direct Preference Optimization (DPO) scales stably to 70B and raises open-ended generation quality; QLoRA (parameter-efficient finetuning) lags on long-form generation; CODE TÜLU 2 substantially boosts code ability but can reduce general conversational performance. All models, data, and code are released.

Problem Statement

Open instruction-tuned models evolve rapidly (better bases, distilled datasets, new finetuning methods). The paper asks: what combination of new base models, datasets, and training recipes gives the best open instruction-following models, and how do pragmatic choices (DPO, QLoRA, CODE pretraining) trade performance across tasks?

Main Contribution

TÜLU-V2-mix: a curated instruction dataset mixture (326k samples) combining high-quality human and distilled data.

TÜLU 2 model suite: LLaMA-2-based instruction-tuned models at 7B/13B/70B and CODE LLAMA variants (7B/13B/34B) finetuned on V2 mix.

DPO at scale: DPO (direct preference optimization) applied stably to 70B models (TÜLU 2+DPO 70B) with improved open-ended generation.

Empirical study of QLoRA vs full finetuning, and CODE LLAMA finetuning trade-offs across benchmarks.

Public release of weights, datasets, and finetuning/evaluation code.

Key Findings

The V2 data mixture improves average downstream performance over the prior V1 mixture.

NumbersV2 > V1 by ~8% avg (paper intro)

DPO training raises open-ended generation quality (AlpacaEval) across sizes.

NumbersAlpacaEval +13% avg across sizes (paper intro); TÜLU2+DPO 70B 95.1% vs TÜLU2 70B 86.6% (∆ +8.5)

DPO training scales stably to 70B and yields the best open-weight MT-Bench score reported.

NumbersMT-Bench 7.89 for TÜLU2+DPO 70B (best open model in paper)

QLoRA (quantized LoRA) underperforms full finetuning on long-form/open-ended generation.

NumbersAlpacaEval gap up to ~20% worse for QLoRA vs full-finetune (paper intro); overall avg gap shrinks from ~10% to ~3% with

CODE TÜLU 2 greatly improves coding performance but reduces some non-code capabilities.

NumbersCodex-Eval boost reported as ~70% avg; AlpacaEval drops ~20% (paper intro; Table 6 shows large Codex-Eval gains)

DPO finetuning can reduce multilingual performance.

NumbersTydiQA drop example: −17.8 points for 70B (Table 3 ∆)

Results

AlpacaEval win-rate

ValueTÜLU 2+DPO 70B: 95.1% vs TÜLU 2 70B: 86.6% (∆ +8.5)

BaselineTÜLU 2 70B

MT-Bench average score

ValueTÜLU 2+DPO 70B: 7.89 vs TÜLU 2 70B: 7.49 (∆ +0.40)

BaselineTÜLU 2 70B

TydiQA (multilingual) performance drop with DPO

ValueTydiQA 70B: TÜLU2 53.6 -> TÜLU2+DPO 35.8 (∆ −17.8)

BaselineTÜLU 2 70B

Codex-Eval (code)

ValueCODE TÜLU 2 (34B) Pass@10: 82.5 vs CODE LLAMA base (34B): 77.6 (∆ +4.9), and large gains versus LLAMA-based TÜLU 2 on 7B

BaselineCODE LLAMA base

Who Should Care

What To Try In 7 Days

Swap your instruction finetune data to the TÜLU-V2-mix and retune a smaller model to measure gains.

If you have preference pairs, try DPO on a mid-size model to see immediate AlpacaEval/MT-Bench boosts.

Test QLoRA only if compute-limited; validate on your open-ended generation tasks before deploying.

Optimization Features

Infra Optimization

  • TPU v3 pods (256/512 chips used)

Model Optimization

  • LoRA

System Optimization

  • Extended context length to 8,192 tokens

Training Optimization

  • DPO
  • Direct preference optimization
  • SFT

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • DPO and SFT mixes are majority English; DPO can reduce multilingual performance without multilingual data.
  • QLoRA underperforms for open-ended long-form generation and may not be a drop-in replacement for full finetuning.
  • CODE TÜLU 2 improves code but can degrade general generation and instruction behavior.

When Not To Use

  • Avoid DPO if multilingual outputs are required and multilingual preference data is missing.
  • Avoid QLoRA for products that rely on long, open-ended, high-quality generations.
  • Avoid CODE-TÜLU2 base if your product needs best general QA or reasoning across languages.

Failure Modes

  • DPO increases verbosity which may harm concise-answer use cases.
  • Training-data contamination can bias evaluation (paper omits some comparisons when contamination exists).
  • Finetuning code-pretrained bases can shift capability trade-offs away from non-code tasks.

Core Entities

Models

  • TÜLU 2
  • TÜLU 2+DPO
  • CODE TÜLU 2
  • LLAMA-2
  • CODE LLAMA
  • LoRA
  • DPO

Metrics

  • Accuracy
  • Pass@10
  • AlpacaEval win-rate
  • MT-Bench score
  • ToxiGen % toxic
  • TruthfulQA % informative+truthful

Datasets

  • TÜLU-V2-mix
  • TÜLU-V1-mix
  • ShareGPT
  • FLAN v2
  • WizardLM Evol-Instruct V2
  • Open-Orca
  • LIMA
  • GPT4-Alpaca
  • Code-Alpaca
  • UltraFeedback
  • Science literature mix

Benchmarks

  • MMLU
  • GSM8k
  • BBH
  • TydiQA
  • Codex-Eval (HumanEval)
  • AlpacaEval
  • MT-Bench
  • ToxiGen
  • TruthfulQA

Context Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • Zephyr-Beta
  • Xwin-LM
  • LLAMA-2-chat

Metrics

  • Average (naive) across tasks
  • Per-benchmark deltas

Datasets

  • UltraChat
  • OpenAssistant
  • ShareGPT processed

Benchmarks

  • AlpacaEval leaderboard
  • MT-Bench leaderboard