Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

May 7, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

2

Authors

Emre Can Acikgoz, Mete Erdogan, Deniz Yuret

Links

Abstract / PDF

Why It Matters For Business

If you need Turkish NLP quickly, adapting a strong English model (Mistral) gives better QA accuracy for a given compute budget than training medium-sized models from scratch; validated benchmarks let you measure real gains.

Summary TLDR

The authors release the Hamza family of Turkish decoder LLMs (124M–1.3B params), build two validated Turkish benchmarks (ARC-TR, TruthfulQA-TR), and compare two routes for non-English LLMs: (A) adapt strong English-pretrained models with Turkish continued pretraining (using LoRA) and (B) train from scratch on Turkish data. Results show adapting a strong base (Mistral-7B) gives better Turkish QA accuracy under limited compute, but continued pretraining causes measurable catastrophic forgetting on English tasks. Instruction tuning with a Turkish Self-Instruct dataset yields modest gains. All code, checkpoints and datasets are released.

Problem Statement

Building good LLMs for under-served languages faces data scarcity, limited compute, missing benchmarks, and risks that adapting English models will erase prior knowledge. The paper tests practical strategies for Turkish: continued pretraining of English models vs training from scratch, and creates validated Turkish evaluation sets.

Main Contribution

Released Hamza LLM family (124M to 1.3B params) trained on Turkish data and published checkpoints and configs.

Compared two engineering paths: continued pretraining/adaptation of English base models (Mistral-7B, GPT2-xl) vs training from scratch (Hamza series).

Created and human-validated Turkish benchmarks TruthfulQA-TR and ARC-TR and launched a Turkish leaderboard.

Built a Turkish Self-Instruct instruction‑tuning dataset (≈50.8k samples) and measured the SFT impact.

Key Findings

Adapting a strong English base (Mistral-7B) to Turkish outperforms training Hamza models from scratch under the same resource constraints.

NumbersHamza Mistral avg accuracy 43.12 vs Hamza-xl 35.28 (ARC-TR & TruthfulQA-TR, Table 5)

Continued pretraining on Turkish causes catastrophic forgetting of English abilities.

NumbersMistral-7B ARC 61.52 → Hamza Mistral (5GB) ARC 45.90 (drop ≈15.6 points, Table in Section 5.3)

Supervised instruction‑tuning with the Turkish Self-Instruct dataset yields modest downstream gains.

NumbersHamza-xlarge avg 35.28 → +SFT avg 37.14 (+1.86 points, Table 6)

The authors provide validated Turkish benchmarks but translation reliability varies.

NumbersARC Fleiss' Kappa 0.44; TruthfulQA Fleiss' Kappa 0.17 (annotation tables G)

Results

Accuracy

ValueHamza Mistral 39.85

BaselineHamza-xl 28.24

Accuracy

ValueHamza Mistral 46.40

BaselineHamza-xl 42.33

BPC (trnews-64)

ValueHamza-xlarge 0.754

BaselineKanarya-2b 0.724 (best shown)

Accuracy

ValueHamza-xlarge + SFT avg 37.14

BaselineHamza-xlarge avg 35.28

Who Should Care

What To Try In 7 Days

Run continued pretraining on a strong English base with a small Turkish split using LoRA adapters.

Evaluate with ARC-TR and TruthfulQA-TR and inspect translations flagged by annotators.

Apply supervised instruction tuning using the released 50.8k Turkish IT dataset and measure modest QA gains.

Optimization Features

Token Efficiency

  • Evaluation uses Bits-Per-Character to normalize tokenizer differences

Infra Optimization

  • Training on 8× A100 (80GB) GPUs for main Hamza models

Model Optimization

  • LoRA
  • flash-attention for faster training

System Optimization

  • Batch-size and lr scaled per model; 1024 token context

Training Optimization

  • AdamW optimizer with cosine lr schedule
  • fp16 mixed precision
  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model sizes stop at 1.3B; larger (>7B) models likely needed to match English SOTA.
  • Continued pretraining on Turkish causes catastrophic forgetting of English skills.
  • TruthfulQA-TR translation reliability is low (Fleiss' Kappa 0.17); some items need rework.
  • Pretraining data diversity is limited and authors recommend far more Turkish tokens for parity.

When Not To Use

  • If you require state-of-the-art cross-lingual reasoning or English performance retention without mitigation.
  • When production-grade safety and robustness need extensive, high-quality multilingual data.

Failure Modes

  • Catastrophic forgetting of prior-language capabilities after continued pretraining.
  • Overfitting to web-scraped content leading to biased outputs.
  • Mid-generation language switching when adapting monolingual outputs (English→Turkish).

Core Entities

Models

  • Hamza-small
  • Hamza-medium
  • Hamza-large
  • Hamza-xl
  • Hamza Mistral
  • Hamza GPT2-xl

Metrics

  • Accuracy
  • Bits-Per-Character (BPC)
  • BPC (trnews-64)

Datasets

  • CulturaX (Turkish split)
  • Self-Instruct Turkish IT (50.8k samples)
  • TruthfulQA-TR
  • ARC-TR
  • trnews-64

Benchmarks

  • ARC-TR
  • TruthfulQA-TR
  • trnews-64 (BPC)

Context Entities

Models

  • Mistral-7B
  • GPT2-xl
  • Gemma 7B
  • Kanarya-2b
  • LLaMA2
  • Trendyol-7b

Metrics

  • Perplexity
  • Negative Log-Likelihood (NLL)

Datasets

  • mC4
  • OSCAR
  • Common Crawl (as part of mC4/OSCAR)