Overview
The paper provides usable models and data but models were trained with limited compute and evaluated on automatically translated benchmarks, so results are practical but preliminary.
Citations2
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 45%
Novelty: 25%
Why It Matters For Business
If you need Dutch-capable LLMs quickly, this work gives deployable models, translated instruction datasets and quantised weights so teams can iterate without building corpora from scratch.
Who Should Care
Summary TLDR
This paper releases two Llama 2 13B finetunes for Dutch (a text-completion and a chat variant), four translated instruction/chat datasets (Dolly, Quora, Stack Overflow, Alpaca), translation scripts, quantised chat weights and a community leaderboard. Models were trained with limited compute (QLoRA, LoRA on a subset of layers) and evaluated on translated Dutch versions of ARC, HellaSwag, MMLU and TruthfulQA. Benchmarks show modest numeric gains from Dutch finetuning but clear practical improvement for conversational Dutch. All assets are published on Hugging Face.
Problem Statement
Dutch is underrepresented in LLMs: few pretrained Dutch models, scarce Dutch instruction/chat datasets and no dedicated generative-model leaderboards. This paper provides models, translated instruction datasets, translation code, and a Dutch-focused leaderboard to lower the entry bar for Dutch LLM work.
Main Contribution
Two Llama 2 13B finetuned models for Dutch: a causal text model and a chat (instruction-following) model; weights and quantised chat variants released.
Four translated instruction/chat datasets (Dolly-15k, Quora-chat, StackOverflow-chat, Alpaca-cleaned) plus scripts to translate datasets using local HF models or Azure OpenAI.
Key Findings
Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.
Four instruction/chat datasets were translated to Dutch and published.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Avg. leaderboard score | 0.44 | zephyr-7b-beta 0.49 | -0.05 vs top | Table 1 (ARC, HellaSwag, MMLU, TruthfulQA) | Llama2-13b-ft-mc4_nl_cleaned_tiny avg 0.44 (Table 1) | Table 1 |
| TruthfulQA (0-shot) | 0.44 | llama-2-13b-chat-hf 0.43 | +0.01 | TruthfulQA (translated to Dutch) | Llama-2-13b-chat-dutch scored 0.44 (Table 1) | Table 1 |
What To Try In 7 Days
Download the chat-dutch quantised model and run a short user chat test for Dutch UX.
Use the translated Dolly/Alpaca datasets to fine-tune or instruction-tune a small model for a domain-specific Dutch assistant.
Benchmark your model on the provided leaderboard harness to compare against public Dutch and multilingual models.
Agent Features
Architectures
Optimization Features
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
All translated instruction datasets were auto-translated with gpt-3.5-turbo and not manually validated.
Finetuning used parameter-efficient LoRA on only q_proj and v_proj layers, not full finetune.
When Not To Use
For high-stakes Dutch generation where fully validated fluency and factuality are required.
If you need the absolute best benchmark scores across reasoning and multi-domain factual tasks.
Failure Modes
Model may switch back to English mid-conversation.
Hallucinate non-existing Dutch words or produce poor morphology and word order.

