Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
Open, compact Dutch LLMs let teams run fast, inexpensive inference and reproduce experiments; modern multilingual small models often beat older larger Dutch models, so try recent small multilingual options before costly full retraining.
Summary TLDR
Fietje is a family of openly released Dutch-focused small language models (base, instruct, chat) built by continued pretraining of Phi-2 (≈2.78B params) on 28 billion cleaned Dutch tokens. The project prioritizes reproducibility: weights, datasets, configs, and evaluation code are public. Benchmarks show Fietje is competitive for its size—instruction and chat tuning improve results markedly—but newer small multilingual models (e.g., Qwen 2.5, Phi 3.5) soon outperformed it. Use Fietje when you need a transparent, lightweight Dutch model you can reproduce and extend; re-evaluate choices if you need state-of-the-art Dutch performance today.
Problem Statement
Dutch language users lack high-quality, open, compact LLMs and reproducible pipelines. The paper adapts an English-centric small model (Phi 2) to Dutch via continued pretraining and post-training to produce an open, usable Dutch LLM family and evaluates them on multiple Dutch benchmarks.
Main Contribution
Created Fietje family (base, instruct, chat) by continued pretraining of Phi-2 on 28B Dutch tokens.
Open release: model weights, filtered datasets, training configs, and evaluation code on GitHub and Hugging Face.
Extensive zero-shot benchmark suite for Dutch (Global MMLU, ARC, DBRD, Dutch CoLA, XLWIC-NL) with confidence intervals and throughput measures.
Showed instruction and preference tuning (SFT + DPO) substantially improve performance over base continue-pretrained model.
Analysis comparing Fietje to contemporary small multilingual and Dutch-adapted models, highlighting tokenizer and data-mix impacts.
Key Findings
Fietje was continue-pretrained on 28 billion Dutch tokens.
Fietje family sizes: base/instruct/chat ≈2.78B parameters and wiki tokenizer fertility 2.05 (tokens per word).
Fietje Chat runs fast: ~9501 tokens/sec on Dutch Wikipedia on an RTX 3090 (bfloat16, FlashAttention2).
Instruction and chat tuning improved Fietje vs base; Fietje Chat outperformed larger Dutch models on some tasks.
New small multilingual models outperformed many Dutch-adapted larger models on benchmarks.
Results
training data size
model size
wiki fertility (Fietje)
throughput (Fietje wiki tps)
Global MMLU (top models)
DBRD (sentiment) top score
Who Should Care
What To Try In 7 Days
Reproduce Fietje training/eval via the provided GitHub to understand data filters and configs.
Benchmark a recent small multilingual model (Qwen 2.5 or Phi 3.5) on your Dutch tasks before investing in adaptation.
Measure tokenizer fertility on your Dutch data; consider tokenizer updates to cut token costs if fertility >1.6 tokens/word.
Agent Features
Architectures
- decoder-only Transformer
Optimization Features
Token Efficiency
- tokenizer fertility impacts cost; Fietje uses Phi-2 tokenizer (2.05 t/w).
Infra Optimization
- training on 16 A100 80GB GPUs (reported) and benchmarks on RTX 3090
Model Optimization
- continual pretraining (efficient reuse of base model)
- SFT
System Optimization
- benchmarks run with constrained decoding (Outlines) to avoid label hallucinations
Training Optimization
- use of alignment-handbook configs for reproducible runs
- training in bfloat16 with FlashAttention2 enabled
Inference Optimization
- quantized model versions published on Hugging Face
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Started from Phi-2; newer base models quickly surpassed it, so adaptation lags behind new releases.
- Training data covers Dutch CulturaX and Wikipedia only; lacks math and code content found in later models.
- Instruction and preference datasets for Dutch are small and largely synthetic, limiting post-training quality.
- Benchmarks are zero-shot only and use a single prompt template per task, which may under- or overestimate some models.
- Machine-translated benchmarks (ARC, Global MMLU) can introduce translation biases ('translationese').
When Not To Use
- If you need state-of-the-art Dutch accuracy today—recent multilingual small models may outperform Fietje.
- For fluent Dutch text generation without further post-training on native conversational data.
- When high-quality domain-specific code/math reasoning is required; Fietje lacks such pretraining mix.
Failure Modes
- DPO preference tuning risks hallucinations or catastrophic forgetting if hyperparameters (beta) are mis-tuned.
- Poor tokenization increases cost and can slow processing (high fertility with English-centric tokenizer).
- Performance can vary by task: strong on knowledge/classification but weaker on word sense disambiguation (XLWIC).
- Benchmarks using translated data may misrepresent true language fluency and cultural nuance.
Core Entities
Models
- Fietje (fietje-2b, instruct, chat)
- Phi-2
- GEITje-7B-ultra
- Boreas-7B
- Qwen2.5-3B-Instruct
- Phi-3.5-mini-instruct
- Llama-3.2-3B-Instruct
- Tweety-7b-dutch-v24a
- Mistral-7B-Instruct-v0.1
Metrics
- weighted F1
- wiki fertility (tokens per word)
- tokens-per-second (wiki tps)
- processing time (wiki s)
- 95% confidence intervals
Datasets
- CulturaX (Dutch subset)
- Dutch Wikipedia (Nov 2023 dump)
- UltraChat 200K Dutch
- No Robots Dutch
- Belebele
- UltraFeedback Dutch Cleaned
- Orca DPO Pairs Dutch Cleaned
Benchmarks
- Global MMLU (Dutch)
- ARC (translated to Dutch)
- DBRD (Dutch Book Reviews)
- Dutch CoLA
- XLWIC-NL (XL-WiC Dutch)
Context Entities
Models
- Phi-3.5
- Qwen 2.5 family
- Mistral 7B family
- GEITje 7B
Metrics
- fertility comparisons
- median ranking across tasks
Datasets
- mC4 (subset used by Tweety)
- SONAR-500 (quality baseline)
Benchmarks
- ScandEval (referenced)
- MMLU original (English)

