Overview
The experiments and released artifacts are practical and reproducible, but model sizes are small relative to state-of-the-art and some benchmark translations have limited inter-annotator agreement.
Citations2
Evidence Strength0.60
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 40%
Why It Matters For Business
If you need Turkish NLP quickly, adapting a strong English model (Mistral) gives better QA accuracy for a given compute budget than training medium-sized models from scratch; validated benchmarks let you measure real gains.
Who Should Care
Summary TLDR
The authors release the Hamza family of Turkish decoder LLMs (124M–1.3B params), build two validated Turkish benchmarks (ARC-TR, TruthfulQA-TR), and compare two routes for non-English LLMs: (A) adapt strong English-pretrained models with Turkish continued pretraining (using LoRA) and (B) train from scratch on Turkish data. Results show adapting a strong base (Mistral-7B) gives better Turkish QA accuracy under limited compute, but continued pretraining causes measurable catastrophic forgetting on English tasks. Instruction tuning with a Turkish Self-Instruct dataset yields modest gains. All code, checkpoints and datasets are released.
Problem Statement
Building good LLMs for under-served languages faces data scarcity, limited compute, missing benchmarks, and risks that adapting English models will erase prior knowledge. The paper tests practical strategies for Turkish: continued pretraining of English models vs training from scratch, and creates validated Turkish evaluation sets.
Main Contribution
Released Hamza LLM family (124M to 1.3B params) trained on Turkish data and published checkpoints and configs.
Compared two engineering paths: continued pretraining/adaptation of English base models (Mistral-7B, GPT2-xl) vs training from scratch (Hamza series).
Key Findings
Adapting a strong English base (Mistral-7B) to Turkish outperforms training Hamza models from scratch under the same resource constraints.
Continued pretraining on Turkish causes catastrophic forgetting of English abilities.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Hamza Mistral 39.85 | Hamza-xl 28.24 | +11.61 | ARC-TR (25-shot) | Table 4; Table 5 | Tables 4–5 |
| Accuracy | Hamza Mistral 46.40 | Hamza-xl 42.33 | +4.07 | TruthfulQA-TR (6-shot) | Table 4; Table 5 | Tables 4–5 |
What To Try In 7 Days
Run continued pretraining on a strong English base with a small Turkish split using LoRA adapters.
Evaluate with ARC-TR and TruthfulQA-TR and inspect translations flagged by annotators.
Apply supervised instruction tuning using the released 50.8k Turkish IT dataset and measure modest QA gains.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Model sizes stop at 1.3B; larger (>7B) models likely needed to match English SOTA.
Continued pretraining on Turkish causes catastrophic forgetting of English skills.
When Not To Use
If you require state-of-the-art cross-lingual reasoning or English performance retention without mitigation.
When production-grade safety and robustness need extensive, high-quality multilingual data.
Failure Modes
Catastrophic forgetting of prior-language capabilities after continued pretraining.
Overfitting to web-scraped content leading to biased outputs.

