UrduLLaMA 1.0: fine-tuning LLaMA-3.1 for Urdu with 128M tokens and LoRA

February 24, 20256 min

Overview

Decision SnapshotNeeds Validation

Demonstrates practical gains from modest continual pretraining plus LoRA on in-domain Urdu translation, but results are limited by small pretraining budget, narrow test scope, and missing detox; larger or more diverse data could change outcomes.

Citations0

Evidence Strength0.65

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Layba Fiaz, Munief Hassan Tahir, Sana Shams, Sarmad Hussain

Links

Abstract / PDF

Why It Matters For Business

Targeted continual pretraining plus LoRA fine-tuning can give large in-domain translation gains with modest compute, enabling localized Urdu services without training from scratch.

Who Should Care

Summary TLDR

This paper builds UrduLLaMA 1.0 by continually pretraining Llama-3.1-8B-Instruct on 128M curated Urdu tokens and then fine-tuning with LoRA on 41k Urdu instructions plus ~50k English–Urdu sentence pairs. On three translation test sets, UrduLLaMA improves BLEU vs the base LLaMA model—especially in-domain—though a large multilingual translation model (seamless-m4t-v2-large) still leads on some general datasets. The work shows practical gains from targeted adaptation with limited compute but is limited by token budget, narrow evaluation, and the lack of detoxification.

Problem Statement

Open LLMs underperform on low-resource languages like Urdu because training corpora lack sufficient, clean Urdu data and language-specific preprocessing. The paper asks whether modest continual pretraining plus targeted fine-tuning (using LoRA) can improve Urdu translation and instruction following with limited compute.

Main Contribution

Curated and preprocessed a 1.14B-token Urdu dataset (after filtering/deduplication) and used 128M tokens for continual pretraining.

Continual pretraining of LLaMA-3.1-8B-Instruct on Urdu (128M tokens) followed by LoRA-based instruction tuning (41k instructions) and MT fine-tuning (~50k en-ur pairs).

Key Findings

UrduLLaMA 1.0 raises in-house MT BLEU from 10.87 to 28.01.

NumbersBLEU 28.01 vs 10.87 (Table 6)

Practical UseFine-tuning a general LLaMA model on domain-specific Urdu data can roughly triple BLEU; apply targeted domain data to boost in-domain translation quickly.

Evidence RefTable 6

On general-domain test sets the gains are smaller and a large multilingual model can still win.

NumbersTICO-19 BLEU 13.12 (UrduLLaMA) vs 19.22 (seamless-m4t-v2-large)

Practical UseContinual pretraining + LoRA helps, but for broad general-domain translation consider large multilingual translation models or add more diverse pretraining data.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BLEU28.01Llama-3.1-8B-Instruct 10.87+17.14In-house MT test splitUrduLLaMA 28.01 vs Llama-3.1-8B-Instruct 10.87 (Table 6)Table 6
BLEU13.12Llama-3.1-8B-Instruct 10.04+3.08TICO-19UrduLLaMA 13.12 vs Llama-3.1-8B-Instruct 10.04; seamless-m4t-v2-large 19.22 (Table 6)Table 6

What To Try In 7 Days

Collect a small, domain-focused Urdu corpus and run the paper's preprocessing (language filtering, normalization, dedup).

Apply LoRA to an open LLaMA-style 7–8B checkpoint for instruction tuning using ~10k–50k translated/task examples.

Fine-tune on a modest in-domain parallel set and evaluate with BLEU plus a 100-sentence blind human check.

Optimization Features

Token Efficiency
Pretrained on a 128M-token subset (resource-constrained setup)
Infra Optimization
LoRA
Model Optimization
LoRAFull fine-tuning with activation checkpointing and FSDP memory wrap
System Optimization
Activation offloading and checkpointing to fit model on limited GPUs
Training Optimization
Used 128M tokens for continual pretraining to limit computeLoRA

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Continual pretraining used only 128M tokens due to compute limits; coverage is incomplete.

Detoxification was not applied — model can produce harmful or offensive outputs.

When Not To Use

For safety-critical or moderated deployments without detox controls.

As a drop-in replacement for general-purpose multilingual translation where broad domain coverage is needed.

Failure Modes

Generates offensive or harmful content because detox was not applied.

Underperforms on out-of-domain or culturally nuanced Urdu content due to limited pretraining coverage.

Core Entities

Models

UrduLLaMA 1.0Llama-3.1-8B-Instructseamless-m4t-v2-largeopus-mt-en-ur

Metrics

BLEUHuman preference counts

Datasets

UrduLLaMA curated dataset (1.14B tokens after processing)In-house MT corpus (62,970 entries; 50,376 train)TICO-19Tatoeba ChallengeCC-100 (Urdu)OSCAR (Urdu)

Benchmarks

In-house MT test splitTICO-19Tatoeba Challenge

Context Entities

Models

LLaMA 3.1 familyseamless-m4t-v2-largeopus-mt

Metrics

BLEU

Datasets

Alpaca (translated Urdu subset)Dolly (translated Urdu subset)XLSum Urdu