Overview
Production Readiness
0.4
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
If you need Turkish NLP quickly, adapting a strong English model (Mistral) gives better QA accuracy for a given compute budget than training medium-sized models from scratch; validated benchmarks let you measure real gains.
Summary TLDR
The authors release the Hamza family of Turkish decoder LLMs (124M–1.3B params), build two validated Turkish benchmarks (ARC-TR, TruthfulQA-TR), and compare two routes for non-English LLMs: (A) adapt strong English-pretrained models with Turkish continued pretraining (using LoRA) and (B) train from scratch on Turkish data. Results show adapting a strong base (Mistral-7B) gives better Turkish QA accuracy under limited compute, but continued pretraining causes measurable catastrophic forgetting on English tasks. Instruction tuning with a Turkish Self-Instruct dataset yields modest gains. All code, checkpoints and datasets are released.
Problem Statement
Building good LLMs for under-served languages faces data scarcity, limited compute, missing benchmarks, and risks that adapting English models will erase prior knowledge. The paper tests practical strategies for Turkish: continued pretraining of English models vs training from scratch, and creates validated Turkish evaluation sets.
Main Contribution
Released Hamza LLM family (124M to 1.3B params) trained on Turkish data and published checkpoints and configs.
Compared two engineering paths: continued pretraining/adaptation of English base models (Mistral-7B, GPT2-xl) vs training from scratch (Hamza series).
Created and human-validated Turkish benchmarks TruthfulQA-TR and ARC-TR and launched a Turkish leaderboard.
Built a Turkish Self-Instruct instruction‑tuning dataset (≈50.8k samples) and measured the SFT impact.
Key Findings
Adapting a strong English base (Mistral-7B) to Turkish outperforms training Hamza models from scratch under the same resource constraints.
Continued pretraining on Turkish causes catastrophic forgetting of English abilities.
Supervised instruction‑tuning with the Turkish Self-Instruct dataset yields modest downstream gains.
The authors provide validated Turkish benchmarks but translation reliability varies.
Results
Accuracy
Accuracy
BPC (trnews-64)
Accuracy
Who Should Care
What To Try In 7 Days
Run continued pretraining on a strong English base with a small Turkish split using LoRA adapters.
Evaluate with ARC-TR and TruthfulQA-TR and inspect translations flagged by annotators.
Apply supervised instruction tuning using the released 50.8k Turkish IT dataset and measure modest QA gains.
Optimization Features
Token Efficiency
- Evaluation uses Bits-Per-Character to normalize tokenizer differences
Infra Optimization
- Training on 8× A100 (80GB) GPUs for main Hamza models
Model Optimization
- LoRA
- flash-attention for faster training
System Optimization
- Batch-size and lr scaled per model; 1024 token context
Training Optimization
- AdamW optimizer with cosine lr schedule
- fp16 mixed precision
- LoRA
Reproducibility
Data Urls
- https://github.com/emrecanacikgoz/Bridging-the-Bosphorus (paper page)
- https://huggingface.co/datasets/uonlp/CulturaX
- DeepL API used for initial translation (not a dataset URL)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Model sizes stop at 1.3B; larger (>7B) models likely needed to match English SOTA.
- Continued pretraining on Turkish causes catastrophic forgetting of English skills.
- TruthfulQA-TR translation reliability is low (Fleiss' Kappa 0.17); some items need rework.
- Pretraining data diversity is limited and authors recommend far more Turkish tokens for parity.
When Not To Use
- If you require state-of-the-art cross-lingual reasoning or English performance retention without mitigation.
- When production-grade safety and robustness need extensive, high-quality multilingual data.
Failure Modes
- Catastrophic forgetting of prior-language capabilities after continued pretraining.
- Overfitting to web-scraped content leading to biased outputs.
- Mid-generation language switching when adapting monolingual outputs (English→Turkish).
Core Entities
Models
- Hamza-small
- Hamza-medium
- Hamza-large
- Hamza-xl
- Hamza Mistral
- Hamza GPT2-xl
Metrics
- Accuracy
- Bits-Per-Character (BPC)
- BPC (trnews-64)
Datasets
- CulturaX (Turkish split)
- Self-Instruct Turkish IT (50.8k samples)
- TruthfulQA-TR
- ARC-TR
- trnews-64
Benchmarks
- ARC-TR
- TruthfulQA-TR
- trnews-64 (BPC)
Context Entities
Models
- Mistral-7B
- GPT2-xl
- Gemma 7B
- Kanarya-2b
- LLaMA2
- Trendyol-7b
Metrics
- Perplexity
- Negative Log-Likelihood (NLL)
Datasets
- mC4
- OSCAR
- Common Crawl (as part of mC4/OSCAR)

