Two Dutch-tuned Llama 2 models, translated instruction datasets, and a Dutch leaderboard to jumpstart Dutch LLM work

Overview

Decision SnapshotNeeds Validation

The paper provides usable models and data but models were trained with limited compute and evaluated on automatically translated benchmarks, so results are practical but preliminary.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 45%

Novelty: 25%

Authors

Bram Vanroy

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need Dutch-capable LLMs quickly, this work gives deployable models, translated instruction datasets and quantised weights so teams can iterate without building corpora from scratch.

Who Should Care

ML Engineer Product Manager Founder

Summary TLDR

This paper releases two Llama 2 13B finetunes for Dutch (a text-completion and a chat variant), four translated instruction/chat datasets (Dolly, Quora, Stack Overflow, Alpaca), translation scripts, quantised chat weights and a community leaderboard. Models were trained with limited compute (QLoRA, LoRA on a subset of layers) and evaluated on translated Dutch versions of ARC, HellaSwag, MMLU and TruthfulQA. Benchmarks show modest numeric gains from Dutch finetuning but clear practical improvement for conversational Dutch. All assets are published on Hugging Face.

Problem Statement

Dutch is underrepresented in LLMs: few pretrained Dutch models, scarce Dutch instruction/chat datasets and no dedicated generative-model leaderboards. This paper provides models, translated instruction datasets, translation code, and a Dutch-focused leaderboard to lower the entry bar for Dutch LLM work.

Main Contribution

Two Llama 2 13B finetuned models for Dutch: a causal text model and a chat (instruction-following) model; weights and quantised chat variants released.

Four translated instruction/chat datasets (Dolly-15k, Quora-chat, StackOverflow-chat, Alpaca-cleaned) plus scripts to translate datasets using local HF models or Azure OpenAI.

Key Findings

Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.

NumbersFinetune compute: 120 GPU hours (text), ~55 GPU hours (chat)

Practical UseYou can test and deploy Dutch-capable 13B models today; expect modest resource needs to replicate small finetunes but do not expect full pretrained-level quality.

Evidence RefSec. 3; Sec. 5; model cards on Hugging Face

Four instruction/chat datasets were translated to Dutch and published.

NumbersSizes: Dolly 15k, Quora ~55k, StackOverflow ~56.9k, Alpaca ~51.7k

Practical UseUse these translated datasets as a ready baseline for instruction-tuning or to bootstrap Dutch chat datasets; verify translations before high-stakes use.

Evidence RefSec. 2; dataset links on Hugging Face

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Avg. leaderboard score	0.44	zephyr-7b-beta 0.49	-0.05 vs top	Table 1 (ARC, HellaSwag, MMLU, TruthfulQA)	Llama2-13b-ft-mc4_nl_cleaned_tiny avg 0.44 (Table 1)	Table 1
TruthfulQA (0-shot)	0.44	llama-2-13b-chat-hf 0.43	+0.01	TruthfulQA (translated to Dutch)	Llama-2-13b-chat-dutch scored 0.44 (Table 1)	Table 1

What To Try In 7 Days

Download the chat-dutch quantised model and run a short user chat test for Dutch UX.

Use the translated Dolly/Alpaca datasets to fine-tune or instruction-tune a small model for a domain-specific Dutch assistant.

Benchmark your model on the provided leaderboard harness to compare against public Dutch and multilingual models.

Agent Features

Architectures

decoder-only

Optimization Features

Model Optimization

4-bit quantizationLoRA

System Optimization

Fine-tune for full 4096 context coverage

Training Optimization

LoRAFlashAttention

Inference Optimization

Quantised chat weights for efficient deployment

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/BramVanroy/dutch-instruction-datasets https://huggingface.co/BramVanroy/finetuned-llms-for-dutch-64f99dddbb86c0fa80846f89 https://huggingface.co/datasets/BramVanroy/dutch_chat_datasets

Data URLs

https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch https://huggingface.co/datasets/BramVanroy/quora-chat-dutch https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned

Risks & Boundaries

Limitations

All translated instruction datasets were auto-translated with gpt-3.5-turbo and not manually validated.

Finetuning used parameter-efficient LoRA on only q_proj and v_proj layers, not full finetune.

When Not To Use

For high-stakes Dutch generation where fully validated fluency and factuality are required.

If you need the absolute best benchmark scores across reasoning and multi-domain factual tasks.

Failure Modes

Model may switch back to English mid-conversation.

Hallucinate non-existing Dutch words or produce poor morphology and word order.

Core Entities

Models

llama2-13b-ft-mc4_nl_cleaned_tinyLlama-2-13b-chat-dutchllama-2-13b-chat-hfllama-2-13b-hfzephyr-7b-betageitje-7bgeitje-7b-chatmistral-7b-v0.1neural-chat-7b-v3-1orca-2-13borca-2-7bgpt2-medium-dutchgpt-neo-1.3b-dutch

Metrics

average_scoreAccuracy

Datasets

mc4_nl_cleaned_tinydolly-15k-dutchquora-chat-dutchstackoverflow-chat-dutchalpaca-cleaned-dutch

Benchmarks

ARCHellaSwagMMLUTruthfulQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.

Four instruction/chat datasets were translated to Dutch and published.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding