Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Targeted continual pretraining plus LoRA fine-tuning can give large in-domain translation gains with modest compute, enabling localized Urdu services without training from scratch.
Summary TLDR
This paper builds UrduLLaMA 1.0 by continually pretraining Llama-3.1-8B-Instruct on 128M curated Urdu tokens and then fine-tuning with LoRA on 41k Urdu instructions plus ~50k English–Urdu sentence pairs. On three translation test sets, UrduLLaMA improves BLEU vs the base LLaMA model—especially in-domain—though a large multilingual translation model (seamless-m4t-v2-large) still leads on some general datasets. The work shows practical gains from targeted adaptation with limited compute but is limited by token budget, narrow evaluation, and the lack of detoxification.
Problem Statement
Open LLMs underperform on low-resource languages like Urdu because training corpora lack sufficient, clean Urdu data and language-specific preprocessing. The paper asks whether modest continual pretraining plus targeted fine-tuning (using LoRA) can improve Urdu translation and instruction following with limited compute.
Main Contribution
Curated and preprocessed a 1.14B-token Urdu dataset (after filtering/deduplication) and used 128M tokens for continual pretraining.
Continual pretraining of LLaMA-3.1-8B-Instruct on Urdu (128M tokens) followed by LoRA-based instruction tuning (41k instructions) and MT fine-tuning (~50k en-ur pairs).
Evaluated translation quality with BLEU on three test sets and a blind human evaluation with two native linguists; reported clear in-domain BLEU gains over the base model.
Key Findings
UrduLLaMA 1.0 raises in-house MT BLEU from 10.87 to 28.01.
On general-domain test sets the gains are smaller and a large multilingual model can still win.
Human judgments (300 sentences) favored seamless-m4t-v2-large overall; UrduLLaMA improved versus the base model on some sets.
Results
BLEU
BLEU
BLEU
Human preference (count)
Who Should Care
What To Try In 7 Days
Collect a small, domain-focused Urdu corpus and run the paper's preprocessing (language filtering, normalization, dedup).
Apply LoRA to an open LLaMA-style 7–8B checkpoint for instruction tuning using ~10k–50k translated/task examples.
Fine-tune on a modest in-domain parallel set and evaluate with BLEU plus a 100-sentence blind human check.
Optimization Features
Token Efficiency
- Pretrained on a 128M-token subset (resource-constrained setup)
Infra Optimization
- LoRA
Model Optimization
- LoRA
- Full fine-tuning with activation checkpointing and FSDP memory wrap
System Optimization
- Activation offloading and checkpointing to fit model on limited GPUs
Training Optimization
- Used 128M tokens for continual pretraining to limit compute
- LoRA
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Continual pretraining used only 128M tokens due to compute limits; coverage is incomplete.
- Detoxification was not applied — model can produce harmful or offensive outputs.
- Evaluation focused on translation and a small human sample; other capabilities untested.
- In-house MT data is private; results may favor models tuned on similar domains.
When Not To Use
- For safety-critical or moderated deployments without detox controls.
- As a drop-in replacement for general-purpose multilingual translation where broad domain coverage is needed.
- When legal/privacy guarantees require fully public, auditable training data.
Failure Modes
- Generates offensive or harmful content because detox was not applied.
- Underperforms on out-of-domain or culturally nuanced Urdu content due to limited pretraining coverage.
- Possible memorization of web-scraped content if deduplication missed cases.
Core Entities
Models
- UrduLLaMA 1.0
- Llama-3.1-8B-Instruct
- seamless-m4t-v2-large
- opus-mt-en-ur
Metrics
- BLEU
- Human preference counts
Datasets
- UrduLLaMA curated dataset (1.14B tokens after processing)
- In-house MT corpus (62,970 entries; 50,376 train)
- TICO-19
- Tatoeba Challenge
- CC-100 (Urdu)
- OSCAR (Urdu)
Benchmarks
- In-house MT test split
- TICO-19
- Tatoeba Challenge
Context Entities
Models
- LLaMA 3.1 family
- seamless-m4t-v2-large
- opus-mt
Metrics
- BLEU
Datasets
- Alpaca (translated Urdu subset)
- Dolly (translated Urdu subset)
- XLSum Urdu

