Fietje: open, compact Dutch LLM (2.8B) trained on 28B Dutch tokens with full reproducibility

December 19, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Bram Vanroy

Links

Abstract / PDF

Why It Matters For Business

Open, compact Dutch LLMs let teams run fast, inexpensive inference and reproduce experiments; modern multilingual small models often beat older larger Dutch models, so try recent small multilingual options before costly full retraining.

Summary TLDR

Fietje is a family of openly released Dutch-focused small language models (base, instruct, chat) built by continued pretraining of Phi-2 (≈2.78B params) on 28 billion cleaned Dutch tokens. The project prioritizes reproducibility: weights, datasets, configs, and evaluation code are public. Benchmarks show Fietje is competitive for its size—instruction and chat tuning improve results markedly—but newer small multilingual models (e.g., Qwen 2.5, Phi 3.5) soon outperformed it. Use Fietje when you need a transparent, lightweight Dutch model you can reproduce and extend; re-evaluate choices if you need state-of-the-art Dutch performance today.

Problem Statement

Dutch language users lack high-quality, open, compact LLMs and reproducible pipelines. The paper adapts an English-centric small model (Phi 2) to Dutch via continued pretraining and post-training to produce an open, usable Dutch LLM family and evaluates them on multiple Dutch benchmarks.

Main Contribution

Created Fietje family (base, instruct, chat) by continued pretraining of Phi-2 on 28B Dutch tokens.

Open release: model weights, filtered datasets, training configs, and evaluation code on GitHub and Hugging Face.

Extensive zero-shot benchmark suite for Dutch (Global MMLU, ARC, DBRD, Dutch CoLA, XLWIC-NL) with confidence intervals and throughput measures.

Showed instruction and preference tuning (SFT + DPO) substantially improve performance over base continue-pretrained model.

Analysis comparing Fietje to contemporary small multilingual and Dutch-adapted models, highlighting tokenizer and data-mix impacts.

Key Findings

Fietje was continue-pretrained on 28 billion Dutch tokens.

Numbers28B Dutch tokens

Fietje family sizes: base/instruct/chat ≈2.78B parameters and wiki tokenizer fertility 2.05 (tokens per word).

Numbers2.78B params; wiki fertility = 2.05

Fietje Chat runs fast: ~9501 tokens/sec on Dutch Wikipedia on an RTX 3090 (bfloat16, FlashAttention2).

Numberswiki tps = 9501.41 ± 0.66

Instruction and chat tuning improved Fietje vs base; Fietje Chat outperformed larger Dutch models on some tasks.

NumbersFietje-2b-chat Global MMLU = 26.36 ±0.25; median rank improved vs base

New small multilingual models outperformed many Dutch-adapted larger models on benchmarks.

NumbersQwen2.5 Global MMLU = 50.33 ±0.14; Phi-3.5 = 48.34 ±0.10

Results

training data size

Value28B tokens (Dutch CulturaX + Wikipedia subset)

model size

Value≈2.78B parameters (Phi-2 base)

wiki fertility (Fietje)

Value2.05 tokens/word

BaselineTweety 1.41 tokens/word

throughput (Fietje wiki tps)

Value9501.41 ± 0.66 tokens/sec

BaselineGEITje-7B-ultra 4035.27 ±0.64

Global MMLU (top models)

ValueQwen2.5 = 50.33 ±0.14; Phi-3.5 = 48.34 ±0.10; Fietje-chat = 26.36 ±0.25

DBRD (sentiment) top score

ValueBoreas-7B-chat = 94.38 ±0.27

BaselineState-of-the-art finetuned encoder = 95.14 F1 (Delobelle et al. 2020)

Who Should Care

What To Try In 7 Days

Reproduce Fietje training/eval via the provided GitHub to understand data filters and configs.

Benchmark a recent small multilingual model (Qwen 2.5 or Phi 3.5) on your Dutch tasks before investing in adaptation.

Measure tokenizer fertility on your Dutch data; consider tokenizer updates to cut token costs if fertility >1.6 tokens/word.

Agent Features

Architectures

  • decoder-only Transformer

Optimization Features

Token Efficiency

  • tokenizer fertility impacts cost; Fietje uses Phi-2 tokenizer (2.05 t/w).

Infra Optimization

  • training on 16 A100 80GB GPUs (reported) and benchmarks on RTX 3090

Model Optimization

  • continual pretraining (efficient reuse of base model)
  • SFT

System Optimization

  • benchmarks run with constrained decoding (Outlines) to avoid label hallucinations

Training Optimization

  • use of alignment-handbook configs for reproducible runs
  • training in bfloat16 with FlashAttention2 enabled

Inference Optimization

  • quantized model versions published on Hugging Face

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Started from Phi-2; newer base models quickly surpassed it, so adaptation lags behind new releases.
  • Training data covers Dutch CulturaX and Wikipedia only; lacks math and code content found in later models.
  • Instruction and preference datasets for Dutch are small and largely synthetic, limiting post-training quality.
  • Benchmarks are zero-shot only and use a single prompt template per task, which may under- or overestimate some models.
  • Machine-translated benchmarks (ARC, Global MMLU) can introduce translation biases ('translationese').

When Not To Use

  • If you need state-of-the-art Dutch accuracy today—recent multilingual small models may outperform Fietje.
  • For fluent Dutch text generation without further post-training on native conversational data.
  • When high-quality domain-specific code/math reasoning is required; Fietje lacks such pretraining mix.

Failure Modes

  • DPO preference tuning risks hallucinations or catastrophic forgetting if hyperparameters (beta) are mis-tuned.
  • Poor tokenization increases cost and can slow processing (high fertility with English-centric tokenizer).
  • Performance can vary by task: strong on knowledge/classification but weaker on word sense disambiguation (XLWIC).
  • Benchmarks using translated data may misrepresent true language fluency and cultural nuance.

Core Entities

Models

  • Fietje (fietje-2b, instruct, chat)
  • Phi-2
  • GEITje-7B-ultra
  • Boreas-7B
  • Qwen2.5-3B-Instruct
  • Phi-3.5-mini-instruct
  • Llama-3.2-3B-Instruct
  • Tweety-7b-dutch-v24a
  • Mistral-7B-Instruct-v0.1

Metrics

  • weighted F1
  • wiki fertility (tokens per word)
  • tokens-per-second (wiki tps)
  • processing time (wiki s)
  • 95% confidence intervals

Datasets

  • CulturaX (Dutch subset)
  • Dutch Wikipedia (Nov 2023 dump)
  • UltraChat 200K Dutch
  • No Robots Dutch
  • Belebele
  • UltraFeedback Dutch Cleaned
  • Orca DPO Pairs Dutch Cleaned

Benchmarks

  • Global MMLU (Dutch)
  • ARC (translated to Dutch)
  • DBRD (Dutch Book Reviews)
  • Dutch CoLA
  • XLWIC-NL (XL-WiC Dutch)

Context Entities

Models

  • Phi-3.5
  • Qwen 2.5 family
  • Mistral 7B family
  • GEITje 7B

Metrics

  • fertility comparisons
  • median ranking across tasks

Datasets

  • mC4 (subset used by Tweety)
  • SONAR-500 (quality baseline)

Benchmarks

  • ScandEval (referenced)
  • MMLU original (English)