UrduLLaMA 1.0: fine-tuning LLaMA-3.1 for Urdu with 128M tokens and LoRA

Overview

Decision SnapshotNeeds Validation

Demonstrates practical gains from modest continual pretraining plus LoRA on in-domain Urdu translation, but results are limited by small pretraining budget, narrow test scope, and missing detox; larger or more diverse data could change outcomes.

Citations0

Evidence Strength0.65

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Layba Fiaz, Munief Hassan Tahir, Sana Shams, Sarmad Hussain

Links

Abstract / PDF

Why It Matters For Business

Targeted continual pretraining plus LoRA fine-tuning can give large in-domain translation gains with modest compute, enabling localized Urdu services without training from scratch.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This paper builds UrduLLaMA 1.0 by continually pretraining Llama-3.1-8B-Instruct on 128M curated Urdu tokens and then fine-tuning with LoRA on 41k Urdu instructions plus ~50k English–Urdu sentence pairs. On three translation test sets, UrduLLaMA improves BLEU vs the base LLaMA model—especially in-domain—though a large multilingual translation model (seamless-m4t-v2-large) still leads on some general datasets. The work shows practical gains from targeted adaptation with limited compute but is limited by token budget, narrow evaluation, and the lack of detoxification.

Problem Statement

Open LLMs underperform on low-resource languages like Urdu because training corpora lack sufficient, clean Urdu data and language-specific preprocessing. The paper asks whether modest continual pretraining plus targeted fine-tuning (using LoRA) can improve Urdu translation and instruction following with limited compute.

Main Contribution

Curated and preprocessed a 1.14B-token Urdu dataset (after filtering/deduplication) and used 128M tokens for continual pretraining.

Continual pretraining of LLaMA-3.1-8B-Instruct on Urdu (128M tokens) followed by LoRA-based instruction tuning (41k instructions) and MT fine-tuning (~50k en-ur pairs).

Key Findings

UrduLLaMA 1.0 raises in-house MT BLEU from 10.87 to 28.01.

NumbersBLEU 28.01 vs 10.87 (Table 6)

Practical UseFine-tuning a general LLaMA model on domain-specific Urdu data can roughly triple BLEU; apply targeted domain data to boost in-domain translation quickly.

Evidence RefTable 6

On general-domain test sets the gains are smaller and a large multilingual model can still win.

NumbersTICO-19 BLEU 13.12 (UrduLLaMA) vs 19.22 (seamless-m4t-v2-large)

Practical UseContinual pretraining + LoRA helps, but for broad general-domain translation consider large multilingual translation models or add more diverse pretraining data.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU	28.01	Llama-3.1-8B-Instruct 10.87	+17.14	In-house MT test split	UrduLLaMA 28.01 vs Llama-3.1-8B-Instruct 10.87 (Table 6)	Table 6
BLEU	13.12	Llama-3.1-8B-Instruct 10.04	+3.08	TICO-19	UrduLLaMA 13.12 vs Llama-3.1-8B-Instruct 10.04; seamless-m4t-v2-large 19.22 (Table 6)	Table 6

What To Try In 7 Days

Collect a small, domain-focused Urdu corpus and run the paper's preprocessing (language filtering, normalization, dedup).

Apply LoRA to an open LLaMA-style 7–8B checkpoint for instruction tuning using ~10k–50k translated/task examples.

Fine-tune on a modest in-domain parallel set and evaluate with BLEU plus a 100-sentence blind human check.

Optimization Features

Token Efficiency

Pretrained on a 128M-token subset (resource-constrained setup)

Infra Optimization

LoRA

Model Optimization

LoRAFull fine-tuning with activation checkpointing and FSDP memory wrap

System Optimization

Activation offloading and checkpointing to fit model on limited GPUs

Training Optimization

Used 128M tokens for continual pretraining to limit computeLoRA

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Continual pretraining used only 128M tokens due to compute limits; coverage is incomplete.

Detoxification was not applied — model can produce harmful or offensive outputs.

When Not To Use

For safety-critical or moderated deployments without detox controls.

As a drop-in replacement for general-purpose multilingual translation where broad domain coverage is needed.

Failure Modes

Generates offensive or harmful content because detox was not applied.

Underperforms on out-of-domain or culturally nuanced Urdu content due to limited pretraining coverage.

Core Entities

Models

UrduLLaMA 1.0Llama-3.1-8B-Instructseamless-m4t-v2-largeopus-mt-en-ur

Metrics

BLEUHuman preference counts

Datasets

UrduLLaMA curated dataset (1.14B tokens after processing)In-house MT corpus (62,970 entries; 50,376 train)TICO-19Tatoeba ChallengeCC-100 (Urdu)OSCAR (Urdu)

Benchmarks

In-house MT test splitTICO-19Tatoeba Challenge

Context Entities

Models

LLaMA 3.1 familyseamless-m4t-v2-largeopus-mt

Metrics

BLEU

Datasets

Alpaca (translated Urdu subset)Dolly (translated Urdu subset)XLSum Urdu

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

UrduLLaMA 1.0 raises in-house MT BLEU from 10.87 to 28.01.

On general-domain test sets the gains are smaller and a large multilingual model can still win.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding