Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
8
Why It Matters For Business
TÜLU 2 provides high-quality, open instruction-tuned models and data that approach proprietary baselines for many tasks; DPO improves user-facing outputs and is feasible at 70B without private infra, while CODE TÜLU2 gives a fast route to strong code models.
Summary TLDR
TÜLU 2 is a public release of instruction-finetuned LLaMA-2 and CODE LLaMA models plus a new instruction-data mixture (TÜLU-V2-mix). Key takeaways: the V2 mix improves open-ended and coding tasks over the prior mix; Direct Preference Optimization (DPO) scales stably to 70B and raises open-ended generation quality; QLoRA (parameter-efficient finetuning) lags on long-form generation; CODE TÜLU 2 substantially boosts code ability but can reduce general conversational performance. All models, data, and code are released.
Problem Statement
Open instruction-tuned models evolve rapidly (better bases, distilled datasets, new finetuning methods). The paper asks: what combination of new base models, datasets, and training recipes gives the best open instruction-following models, and how do pragmatic choices (DPO, QLoRA, CODE pretraining) trade performance across tasks?
Main Contribution
TÜLU-V2-mix: a curated instruction dataset mixture (326k samples) combining high-quality human and distilled data.
TÜLU 2 model suite: LLaMA-2-based instruction-tuned models at 7B/13B/70B and CODE LLAMA variants (7B/13B/34B) finetuned on V2 mix.
DPO at scale: DPO (direct preference optimization) applied stably to 70B models (TÜLU 2+DPO 70B) with improved open-ended generation.
Empirical study of QLoRA vs full finetuning, and CODE LLAMA finetuning trade-offs across benchmarks.
Public release of weights, datasets, and finetuning/evaluation code.
Key Findings
The V2 data mixture improves average downstream performance over the prior V1 mixture.
DPO training raises open-ended generation quality (AlpacaEval) across sizes.
DPO training scales stably to 70B and yields the best open-weight MT-Bench score reported.
QLoRA (quantized LoRA) underperforms full finetuning on long-form/open-ended generation.
CODE TÜLU 2 greatly improves coding performance but reduces some non-code capabilities.
DPO finetuning can reduce multilingual performance.
Results
AlpacaEval win-rate
MT-Bench average score
TydiQA (multilingual) performance drop with DPO
Codex-Eval (code)
Who Should Care
What To Try In 7 Days
Swap your instruction finetune data to the TÜLU-V2-mix and retune a smaller model to measure gains.
If you have preference pairs, try DPO on a mid-size model to see immediate AlpacaEval/MT-Bench boosts.
Test QLoRA only if compute-limited; validate on your open-ended generation tasks before deploying.
Optimization Features
Infra Optimization
- TPU v3 pods (256/512 chips used)
Model Optimization
- LoRA
System Optimization
- Extended context length to 8,192 tokens
Training Optimization
- DPO
- Direct preference optimization
- SFT
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- DPO and SFT mixes are majority English; DPO can reduce multilingual performance without multilingual data.
- QLoRA underperforms for open-ended long-form generation and may not be a drop-in replacement for full finetuning.
- CODE TÜLU 2 improves code but can degrade general generation and instruction behavior.
When Not To Use
- Avoid DPO if multilingual outputs are required and multilingual preference data is missing.
- Avoid QLoRA for products that rely on long, open-ended, high-quality generations.
- Avoid CODE-TÜLU2 base if your product needs best general QA or reasoning across languages.
Failure Modes
- DPO increases verbosity which may harm concise-answer use cases.
- Training-data contamination can bias evaluation (paper omits some comparisons when contamination exists).
- Finetuning code-pretrained bases can shift capability trade-offs away from non-code tasks.
Core Entities
Models
- TÜLU 2
- TÜLU 2+DPO
- CODE TÜLU 2
- LLAMA-2
- CODE LLAMA
- LoRA
- DPO
Metrics
- Accuracy
- Pass@10
- AlpacaEval win-rate
- MT-Bench score
- ToxiGen % toxic
- TruthfulQA % informative+truthful
Datasets
- TÜLU-V2-mix
- TÜLU-V1-mix
- ShareGPT
- FLAN v2
- WizardLM Evol-Instruct V2
- Open-Orca
- LIMA
- GPT4-Alpaca
- Code-Alpaca
- UltraFeedback
- Science literature mix
Benchmarks
- MMLU
- GSM8k
- BBH
- TydiQA
- Codex-Eval (HumanEval)
- AlpacaEval
- MT-Bench
- ToxiGen
- TruthfulQA
Context Entities
Models
- GPT-4
- GPT-3.5-turbo
- Zephyr-Beta
- Xwin-LM
- LLAMA-2-chat
Metrics
- Average (naive) across tasks
- Per-benchmark deltas
Datasets
- UltraChat
- OpenAssistant
- ShareGPT processed
Benchmarks
- AlpacaEval leaderboard
- MT-Bench leaderboard

