UrduLLaMA 1.0: fine-tuning LLaMA-3.1 for Urdu with 128M tokens and LoRA

February 24, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Layba Fiaz, Munief Hassan Tahir, Sana Shams, Sarmad Hussain

Links

Abstract / PDF

Why It Matters For Business

Targeted continual pretraining plus LoRA fine-tuning can give large in-domain translation gains with modest compute, enabling localized Urdu services without training from scratch.

Summary TLDR

This paper builds UrduLLaMA 1.0 by continually pretraining Llama-3.1-8B-Instruct on 128M curated Urdu tokens and then fine-tuning with LoRA on 41k Urdu instructions plus ~50k English–Urdu sentence pairs. On three translation test sets, UrduLLaMA improves BLEU vs the base LLaMA model—especially in-domain—though a large multilingual translation model (seamless-m4t-v2-large) still leads on some general datasets. The work shows practical gains from targeted adaptation with limited compute but is limited by token budget, narrow evaluation, and the lack of detoxification.

Problem Statement

Open LLMs underperform on low-resource languages like Urdu because training corpora lack sufficient, clean Urdu data and language-specific preprocessing. The paper asks whether modest continual pretraining plus targeted fine-tuning (using LoRA) can improve Urdu translation and instruction following with limited compute.

Main Contribution

Curated and preprocessed a 1.14B-token Urdu dataset (after filtering/deduplication) and used 128M tokens for continual pretraining.

Continual pretraining of LLaMA-3.1-8B-Instruct on Urdu (128M tokens) followed by LoRA-based instruction tuning (41k instructions) and MT fine-tuning (~50k en-ur pairs).

Evaluated translation quality with BLEU on three test sets and a blind human evaluation with two native linguists; reported clear in-domain BLEU gains over the base model.

Key Findings

UrduLLaMA 1.0 raises in-house MT BLEU from 10.87 to 28.01.

NumbersBLEU 28.01 vs 10.87 (Table 6)

On general-domain test sets the gains are smaller and a large multilingual model can still win.

NumbersTICO-19 BLEU 13.12 (UrduLLaMA) vs 19.22 (seamless-m4t-v2-large)

Human judgments (300 sentences) favored seamless-m4t-v2-large overall; UrduLLaMA improved versus the base model on some sets.

NumbersHuman pref counts e.g., TICO-19: UrduLLaMA 25.5 vs seamless 58 (Table 7)

Results

BLEU

Value28.01

BaselineLlama-3.1-8B-Instruct 10.87

BLEU

Value13.12

BaselineLlama-3.1-8B-Instruct 10.04

BLEU

Value15.16

BaselineLlama-3.1-8B-Instruct 12.49

Human preference (count)

Value23 / 25.5 / 24.5

Baselineseamless-m4t-v2-large 25 / 58 / 62

Who Should Care

What To Try In 7 Days

Collect a small, domain-focused Urdu corpus and run the paper's preprocessing (language filtering, normalization, dedup).

Apply LoRA to an open LLaMA-style 7–8B checkpoint for instruction tuning using ~10k–50k translated/task examples.

Fine-tune on a modest in-domain parallel set and evaluate with BLEU plus a 100-sentence blind human check.

Optimization Features

Token Efficiency

  • Pretrained on a 128M-token subset (resource-constrained setup)

Infra Optimization

  • LoRA

Model Optimization

  • LoRA
  • Full fine-tuning with activation checkpointing and FSDP memory wrap

System Optimization

  • Activation offloading and checkpointing to fit model on limited GPUs

Training Optimization

  • Used 128M tokens for continual pretraining to limit compute
  • LoRA

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Continual pretraining used only 128M tokens due to compute limits; coverage is incomplete.
  • Detoxification was not applied — model can produce harmful or offensive outputs.
  • Evaluation focused on translation and a small human sample; other capabilities untested.
  • In-house MT data is private; results may favor models tuned on similar domains.

When Not To Use

  • For safety-critical or moderated deployments without detox controls.
  • As a drop-in replacement for general-purpose multilingual translation where broad domain coverage is needed.
  • When legal/privacy guarantees require fully public, auditable training data.

Failure Modes

  • Generates offensive or harmful content because detox was not applied.
  • Underperforms on out-of-domain or culturally nuanced Urdu content due to limited pretraining coverage.
  • Possible memorization of web-scraped content if deduplication missed cases.

Core Entities

Models

  • UrduLLaMA 1.0
  • Llama-3.1-8B-Instruct
  • seamless-m4t-v2-large
  • opus-mt-en-ur

Metrics

  • BLEU
  • Human preference counts

Datasets

  • UrduLLaMA curated dataset (1.14B tokens after processing)
  • In-house MT corpus (62,970 entries; 50,376 train)
  • TICO-19
  • Tatoeba Challenge
  • CC-100 (Urdu)
  • OSCAR (Urdu)

Benchmarks

  • In-house MT test split
  • TICO-19
  • Tatoeba Challenge

Context Entities

Models

  • LLaMA 3.1 family
  • seamless-m4t-v2-large
  • opus-mt

Metrics

  • BLEU

Datasets

  • Alpaca (translated Urdu subset)
  • Dolly (translated Urdu subset)
  • XLSum Urdu