Overview
Production Readiness
0.45
Novelty Score
0.25
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
If you need Dutch-capable LLMs quickly, this work gives deployable models, translated instruction datasets and quantised weights so teams can iterate without building corpora from scratch.
Summary TLDR
This paper releases two Llama 2 13B finetunes for Dutch (a text-completion and a chat variant), four translated instruction/chat datasets (Dolly, Quora, Stack Overflow, Alpaca), translation scripts, quantised chat weights and a community leaderboard. Models were trained with limited compute (QLoRA, LoRA on a subset of layers) and evaluated on translated Dutch versions of ARC, HellaSwag, MMLU and TruthfulQA. Benchmarks show modest numeric gains from Dutch finetuning but clear practical improvement for conversational Dutch. All assets are published on Hugging Face.
Problem Statement
Dutch is underrepresented in LLMs: few pretrained Dutch models, scarce Dutch instruction/chat datasets and no dedicated generative-model leaderboards. This paper provides models, translated instruction datasets, translation code, and a Dutch-focused leaderboard to lower the entry bar for Dutch LLM work.
Main Contribution
Two Llama 2 13B finetuned models for Dutch: a causal text model and a chat (instruction-following) model; weights and quantised chat variants released.
Four translated instruction/chat datasets (Dolly-15k, Quora-chat, StackOverflow-chat, Alpaca-cleaned) plus scripts to translate datasets using local HF models or Azure OpenAI.
A public, continuously updated Dutch generative-model leaderboard with benchmark results and model metadata.
Practical notes on training with limited compute (QLoRA / LoRA, flash attention) and evaluation caveats for translated benchmarks.
Key Findings
Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.
Four instruction/chat datasets were translated to Dutch and published.
On translated Dutch benchmarks, Mistral-based models (e.g., Zephyr) outperform Llama 2 based models.
Finetuning Llama 2 on Dutch slightly changed benchmark scores but improved conversational behavior.
Small Dutch-only GPT models with short context windows perform worse on few/multi-shot tasks.
Results
Avg. leaderboard score
TruthfulQA (0-shot)
ARC (25-shot)
Who Should Care
What To Try In 7 Days
Download the chat-dutch quantised model and run a short user chat test for Dutch UX.
Use the translated Dolly/Alpaca datasets to fine-tune or instruction-tune a small model for a domain-specific Dutch assistant.
Benchmark your model on the provided leaderboard harness to compare against public Dutch and multilingual models.
Agent Features
Architectures
- decoder-only
Optimization Features
Model Optimization
- 4-bit quantization
- LoRA
System Optimization
- Fine-tune for full 4096 context coverage
Training Optimization
- LoRA
- FlashAttention
Inference Optimization
- Quantised chat weights for efficient deployment
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch
- https://huggingface.co/datasets/BramVanroy/quora-chat-dutch
- https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch
- https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch
- https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- All translated instruction datasets were auto-translated with gpt-3.5-turbo and not manually validated.
- Finetuning used parameter-efficient LoRA on only q_proj and v_proj layers, not full finetune.
- Benchmarks are automatic translations; translationese may bias results toward English-trained models.
- No statistical significance or confidence intervals reported for benchmark comparisons.
- Training was compute-limited (120 and ~55 GPU hours), so improvements may be conservative.
When Not To Use
- For high-stakes Dutch generation where fully validated fluency and factuality are required.
- If you need the absolute best benchmark scores across reasoning and multi-domain factual tasks.
Failure Modes
- Model may switch back to English mid-conversation.
- Hallucinate non-existing Dutch words or produce poor morphology and word order.
- Performance drop on few/multi-shot tasks for models with small context windows (512 tokens).
- Translated benchmarks may overstate non-Dutch model strengths due to translationese.
Core Entities
Models
- llama2-13b-ft-mc4_nl_cleaned_tiny
- Llama-2-13b-chat-dutch
- llama-2-13b-chat-hf
- llama-2-13b-hf
- zephyr-7b-beta
- geitje-7b
- geitje-7b-chat
- mistral-7b-v0.1
- neural-chat-7b-v3-1
- orca-2-13b
- orca-2-7b
- gpt2-medium-dutch
- gpt-neo-1.3b-dutch
Metrics
- average_score
- Accuracy
Datasets
- mc4_nl_cleaned_tiny
- dolly-15k-dutch
- quora-chat-dutch
- stackoverflow-chat-dutch
- alpaca-cleaned-dutch
Benchmarks
- ARC
- HellaSwag
- MMLU
- TruthfulQA

