Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

May 7, 20247 min

Overview

Decision SnapshotNeeds Validation

The experiments and released artifacts are practical and reproducible, but model sizes are small relative to state-of-the-art and some benchmark translations have limited inter-annotator agreement.

Citations2

Evidence Strength0.60

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 40%

Authors

Emre Can Acikgoz, Mete Erdogan, Deniz Yuret

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need Turkish NLP quickly, adapting a strong English model (Mistral) gives better QA accuracy for a given compute budget than training medium-sized models from scratch; validated benchmarks let you measure real gains.

Who Should Care

Summary TLDR

The authors release the Hamza family of Turkish decoder LLMs (124M–1.3B params), build two validated Turkish benchmarks (ARC-TR, TruthfulQA-TR), and compare two routes for non-English LLMs: (A) adapt strong English-pretrained models with Turkish continued pretraining (using LoRA) and (B) train from scratch on Turkish data. Results show adapting a strong base (Mistral-7B) gives better Turkish QA accuracy under limited compute, but continued pretraining causes measurable catastrophic forgetting on English tasks. Instruction tuning with a Turkish Self-Instruct dataset yields modest gains. All code, checkpoints and datasets are released.

Problem Statement

Building good LLMs for under-served languages faces data scarcity, limited compute, missing benchmarks, and risks that adapting English models will erase prior knowledge. The paper tests practical strategies for Turkish: continued pretraining of English models vs training from scratch, and creates validated Turkish evaluation sets.

Main Contribution

Released Hamza LLM family (124M to 1.3B params) trained on Turkish data and published checkpoints and configs.

Compared two engineering paths: continued pretraining/adaptation of English base models (Mistral-7B, GPT2-xl) vs training from scratch (Hamza series).

Key Findings

Adapting a strong English base (Mistral-7B) to Turkish outperforms training Hamza models from scratch under the same resource constraints.

NumbersHamza Mistral avg accuracy 43.12 vs Hamza-xl 35.28 (ARC-TR & TruthfulQA-TR, Table 5)

Practical UseIf you have limited tokens/GPUs, adapt a high-quality English base (use LoRA/continued pretraining) rather than training a medium-size model from scratch.

Evidence RefTable 5

Continued pretraining on Turkish causes catastrophic forgetting of English abilities.

NumbersMistral-7B ARC 61.52 → Hamza Mistral (5GB) ARC 45.90 (drop ≈15.6 points, Table in Section 5.3)

Practical UseWhen adapting, mix some English data or use adapter methods to reduce forgetting; monitor original-language benchmarks during adaptation.

Evidence RefSection 5.3 ablation table

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyHamza Mistral 39.85Hamza-xl 28.24+11.61ARC-TR (25-shot)Table 4; Table 5Tables 4–5
AccuracyHamza Mistral 46.40Hamza-xl 42.33+4.07TruthfulQA-TR (6-shot)Table 4; Table 5Tables 4–5

What To Try In 7 Days

Run continued pretraining on a strong English base with a small Turkish split using LoRA adapters.

Evaluate with ARC-TR and TruthfulQA-TR and inspect translations flagged by annotators.

Apply supervised instruction tuning using the released 50.8k Turkish IT dataset and measure modest QA gains.

Optimization Features

Token Efficiency
Evaluation uses Bits-Per-Character to normalize tokenizer differences
Infra Optimization
Training on 8× A100 (80GB) GPUs for main Hamza models
Model Optimization
LoRAflash-attention for faster training
System Optimization
Batch-size and lr scaled per model; 1024 token context
Training Optimization
AdamW optimizer with cosine lr schedulefp16 mixed precisionLoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Model sizes stop at 1.3B; larger (>7B) models likely needed to match English SOTA.

Continued pretraining on Turkish causes catastrophic forgetting of English skills.

When Not To Use

If you require state-of-the-art cross-lingual reasoning or English performance retention without mitigation.

When production-grade safety and robustness need extensive, high-quality multilingual data.

Failure Modes

Catastrophic forgetting of prior-language capabilities after continued pretraining.

Overfitting to web-scraped content leading to biased outputs.

Core Entities

Models

Hamza-smallHamza-mediumHamza-largeHamza-xlHamza MistralHamza GPT2-xl

Metrics

AccuracyBits-Per-Character (BPC)BPC (trnews-64)

Datasets

CulturaX (Turkish split)Self-Instruct Turkish IT (50.8k samples)TruthfulQA-TRARC-TRtrnews-64

Benchmarks

ARC-TRTruthfulQA-TRtrnews-64 (BPC)

Context Entities

Models

Mistral-7BGPT2-xlGemma 7BKanarya-2bLLaMA2Trendyol-7b

Metrics

PerplexityNegative Log-Likelihood (NLL)

Datasets

mC4OSCARCommon Crawl (as part of mC4/OSCAR)