Two Dutch-tuned Llama 2 models, translated instruction datasets, and a Dutch leaderboard to jumpstart Dutch LLM work

December 20, 20237 min

Overview

Production Readiness

0.45

Novelty Score

0.25

Cost Impact Score

0.4

Citation Count

2

Authors

Bram Vanroy

Links

Abstract / PDF

Why It Matters For Business

If you need Dutch-capable LLMs quickly, this work gives deployable models, translated instruction datasets and quantised weights so teams can iterate without building corpora from scratch.

Summary TLDR

This paper releases two Llama 2 13B finetunes for Dutch (a text-completion and a chat variant), four translated instruction/chat datasets (Dolly, Quora, Stack Overflow, Alpaca), translation scripts, quantised chat weights and a community leaderboard. Models were trained with limited compute (QLoRA, LoRA on a subset of layers) and evaluated on translated Dutch versions of ARC, HellaSwag, MMLU and TruthfulQA. Benchmarks show modest numeric gains from Dutch finetuning but clear practical improvement for conversational Dutch. All assets are published on Hugging Face.

Problem Statement

Dutch is underrepresented in LLMs: few pretrained Dutch models, scarce Dutch instruction/chat datasets and no dedicated generative-model leaderboards. This paper provides models, translated instruction datasets, translation code, and a Dutch-focused leaderboard to lower the entry bar for Dutch LLM work.

Main Contribution

Two Llama 2 13B finetuned models for Dutch: a causal text model and a chat (instruction-following) model; weights and quantised chat variants released.

Four translated instruction/chat datasets (Dolly-15k, Quora-chat, StackOverflow-chat, Alpaca-cleaned) plus scripts to translate datasets using local HF models or Azure OpenAI.

A public, continuously updated Dutch generative-model leaderboard with benchmark results and model metadata.

Practical notes on training with limited compute (QLoRA / LoRA, flash attention) and evaluation caveats for translated benchmarks.

Key Findings

Two Dutch-tuned Llama 2 13B models were released: a text-completion model and a chat model.

NumbersFinetune compute: 120 GPU hours (text), ~55 GPU hours (chat)

Four instruction/chat datasets were translated to Dutch and published.

NumbersSizes: Dolly 15k, Quora ~55k, StackOverflow ~56.9k, Alpaca ~51.7k

On translated Dutch benchmarks, Mistral-based models (e.g., Zephyr) outperform Llama 2 based models.

NumbersZephyr avg 0.49 vs Llama2 13B variants ~0.43–0.44 (Table 1)

Finetuning Llama 2 on Dutch slightly changed benchmark scores but improved conversational behavior.

NumbersTruthfulQA: chat-dutch 0.44 vs base chat-hf 0.43; avg changes small (~0.01–0.02)

Small Dutch-only GPT models with short context windows perform worse on few/multi-shot tasks.

Numbersgpt2-medium-dutch avg 0.30 vs Llama2-based ~0.44; context 512 vs 4096/8192 tokens

Results

Avg. leaderboard score

Value0.44

Baselinezephyr-7b-beta 0.49

TruthfulQA (0-shot)

Value0.44

Baselinellama-2-13b-chat-hf 0.43

ARC (25-shot)

Value0.40

Baselinezephyr-7b-beta 0.43

Who Should Care

What To Try In 7 Days

Download the chat-dutch quantised model and run a short user chat test for Dutch UX.

Use the translated Dolly/Alpaca datasets to fine-tune or instruction-tune a small model for a domain-specific Dutch assistant.

Benchmark your model on the provided leaderboard harness to compare against public Dutch and multilingual models.

Agent Features

Architectures

  • decoder-only

Optimization Features

Model Optimization

  • 4-bit quantization
  • LoRA

System Optimization

  • Fine-tune for full 4096 context coverage

Training Optimization

  • LoRA
  • FlashAttention

Inference Optimization

  • Quantised chat weights for efficient deployment

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • All translated instruction datasets were auto-translated with gpt-3.5-turbo and not manually validated.
  • Finetuning used parameter-efficient LoRA on only q_proj and v_proj layers, not full finetune.
  • Benchmarks are automatic translations; translationese may bias results toward English-trained models.
  • No statistical significance or confidence intervals reported for benchmark comparisons.
  • Training was compute-limited (120 and ~55 GPU hours), so improvements may be conservative.

When Not To Use

  • For high-stakes Dutch generation where fully validated fluency and factuality are required.
  • If you need the absolute best benchmark scores across reasoning and multi-domain factual tasks.

Failure Modes

  • Model may switch back to English mid-conversation.
  • Hallucinate non-existing Dutch words or produce poor morphology and word order.
  • Performance drop on few/multi-shot tasks for models with small context windows (512 tokens).
  • Translated benchmarks may overstate non-Dutch model strengths due to translationese.

Core Entities

Models

  • llama2-13b-ft-mc4_nl_cleaned_tiny
  • Llama-2-13b-chat-dutch
  • llama-2-13b-chat-hf
  • llama-2-13b-hf
  • zephyr-7b-beta
  • geitje-7b
  • geitje-7b-chat
  • mistral-7b-v0.1
  • neural-chat-7b-v3-1
  • orca-2-13b
  • orca-2-7b
  • gpt2-medium-dutch
  • gpt-neo-1.3b-dutch

Metrics

  • average_score
  • Accuracy

Datasets

  • mc4_nl_cleaned_tiny
  • dolly-15k-dutch
  • quora-chat-dutch
  • stackoverflow-chat-dutch
  • alpaca-cleaned-dutch

Benchmarks

  • ARC
  • HellaSwag
  • MMLU
  • TruthfulQA