Two Dutch-tuned Llama 2 models, translated instruction datasets, and a Dutch leaderboard to jumpstart Dutch LLM work

December 20, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper provides usable models and data but models were trained with limited compute and evaluated on automatically translated benchmarks, so results are practical but preliminary.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 45%

Novelty: 25%

Authors

Bram Vanroy

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need Dutch-capable LLMs quickly, this work gives deployable models, translated instruction datasets and quantised weights so teams can iterate without building corpora from scratch.

Who Should Care

Summary TLDR

This paper releases two Llama 2 13B finetunes for Dutch (a text-completion and a chat variant), four translated instruction/chat datasets (Dolly, Quora, Stack Overflow, Alpaca), translation scripts, quantised chat weights and a community leaderboard. Models were trained with limited compute (QLoRA, LoRA on a subset of layers) and evaluated on translated Dutch versions of ARC, HellaSwag, MMLU and TruthfulQA. Benchmarks show modest numeric gains from Dutch finetuning but clear practical improvement for conversational Dutch. All assets are published on Hugging Face.

Problem Statement

Dutch is underrepresented in LLMs: few pretrained Dutch models, scarce Dutch instruction/chat datasets and no dedicated generative-model leaderboards. This paper provides models, translated instruction datasets, translation code, and a Dutch-focused leaderboard to lower the entry bar for Dutch LLM work.

Main Contribution

Two Llama 2 13B finetuned models for Dutch: a causal text model and a chat (instruction-following) model; weights and quantised chat variants released.

Four translated instruction/chat datasets (Dolly-15k, Quora-chat, StackOverflow-chat, Alpaca-cleaned) plus scripts to translate datasets using local HF models or Azure OpenAI.

Key Findings

Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.

NumbersFinetune compute: 120 GPU hours (text), ~55 GPU hours (chat)

Practical UseYou can test and deploy Dutch-capable 13B models today; expect modest resource needs to replicate small finetunes but do not expect full pretrained-level quality.

Evidence RefSec. 3; Sec. 5; model cards on Hugging Face

Four instruction/chat datasets were translated to Dutch and published.

NumbersSizes: Dolly 15k, Quora ~55k, StackOverflow ~56.9k, Alpaca ~51.7k

Practical UseUse these translated datasets as a ready baseline for instruction-tuning or to bootstrap Dutch chat datasets; verify translations before high-stakes use.

Evidence RefSec. 2; dataset links on Hugging Face

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Avg. leaderboard score0.44zephyr-7b-beta 0.49-0.05 vs topTable 1 (ARC, HellaSwag, MMLU, TruthfulQA)Llama2-13b-ft-mc4_nl_cleaned_tiny avg 0.44 (Table 1)Table 1
TruthfulQA (0-shot)0.44llama-2-13b-chat-hf 0.43+0.01TruthfulQA (translated to Dutch)Llama-2-13b-chat-dutch scored 0.44 (Table 1)Table 1

What To Try In 7 Days

Download the chat-dutch quantised model and run a short user chat test for Dutch UX.

Use the translated Dolly/Alpaca datasets to fine-tune or instruction-tune a small model for a domain-specific Dutch assistant.

Benchmark your model on the provided leaderboard harness to compare against public Dutch and multilingual models.

Agent Features

Architectures
decoder-only

Optimization Features

Model Optimization
4-bit quantizationLoRA
System Optimization
Fine-tune for full 4096 context coverage
Training Optimization
LoRAFlashAttention
Inference Optimization
Quantised chat weights for efficient deployment

Reproducibility

Risks & Boundaries

Limitations

All translated instruction datasets were auto-translated with gpt-3.5-turbo and not manually validated.

Finetuning used parameter-efficient LoRA on only q_proj and v_proj layers, not full finetune.

When Not To Use

For high-stakes Dutch generation where fully validated fluency and factuality are required.

If you need the absolute best benchmark scores across reasoning and multi-domain factual tasks.

Failure Modes

Model may switch back to English mid-conversation.

Hallucinate non-existing Dutch words or produce poor morphology and word order.

Core Entities

Models

llama2-13b-ft-mc4_nl_cleaned_tinyLlama-2-13b-chat-dutchllama-2-13b-chat-hfllama-2-13b-hfzephyr-7b-betageitje-7bgeitje-7b-chatmistral-7b-v0.1neural-chat-7b-v3-1orca-2-13borca-2-7bgpt2-medium-dutchgpt-neo-1.3b-dutch

Metrics

average_scoreAccuracy

Datasets

mc4_nl_cleaned_tinydolly-15k-dutchquora-chat-dutchstackoverflow-chat-dutchalpaca-cleaned-dutch

Benchmarks

ARCHellaSwagMMLUTruthfulQA