Open-source Punjabi LLM suite + a quantum‑inspired hybrid retriever that improves retrieval and generation.

August 3, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.72

Cost Impact Score

0.6

Citation Count

0

Authors

Jaskaranjeet Singh, Rakesh Thakur

Links

Abstract / PDF

Why It Matters For Business

Tools that speak a local language well unlock use cases (education, news, local QA, civic services). A dedicated model + hybrid retrieval gives measurably better accuracy and cultural fit than off-the-shelf multilingual models.

Summary TLDR

The authors build PunGPT2 (a 124M-parameter Punjabi decoder model) trained on a 35GB curated Punjabi corpus, and release three system variants: Pun-RAG (FAISS-based RAG), Pun-Instruct (QLoRA instruction-tuned), and Quantum-RAG — a hybrid retriever that fuses BM25, FAISS dense embeddings, and a quantum‑inspired kernel. They also release PunjabiEval. On the paper's evaluations Quantum-RAG raises Recall@10 by +7.4 points over FAISS and improves generation metrics (example: +3.5 BLEU vs mT5). Training used a single A100 40GB in ~48 hours. Code, data, and weights are claimed as released.

Problem Statement

Punjabi is underrepresented in multilingual LLMs. Poor tokenization and tiny training presence lead to high perplexity and weak generation. The paper aims to provide a dedicated Punjabi model, retrieval grounding, and a benchmark.

Main Contribution

PunGPT2: first decoder-only Punjabi LLM trained on a 35GB curated Punjabi corpus.

Pun-RAG: a FAISS dense retriever based RAG pipeline for Punjabi.

Pun-Instruct: instruction-tuned PunGPT2 using QLoRA (memory-efficient 4-bit fine-tuning).

Quantum-RAG: hybrid retriever combining BM25, FAISS dense embeddings, and a quantum-inspired similarity kernel.

PunjabiEval: new benchmark for translation, summarization, and cultural-fidelity evaluation and an open corpus release.

Key Findings

A 35GB, 4.8M-document Punjabi corpus was assembled and used for training.

Numbers35.5GB corpus, ~4,800,000 documents, 32GB train / 2GB val / 1GB test

PunGPT2 achieves much lower perplexity than multilingual baselines on evaluated data.

NumbersPerplexity: PunGPT2 2.24 vs mT5 28.5 (on paper's test split)

Quantum-RAG improves retrieval and downstream quality over FAISS-only.

NumbersRecall@10: Hybrid 70.1 vs FAISS 62.7 (+7.4); MRR 0.54 vs 0.48

Generation metrics and human-rated cultural fidelity improved compared to multilingual baselines.

NumbersROUGE-L: Quantum-RAG 40.1 vs mT5 33.2 (+6.9); BLEU +3.5 vs mT5 on PunjabiEval; Cultural fidelity 4.8/5 vs 3.9/5

QLoRA instruction tuning improves instruction-following and human-rated quality.

NumbersHuman ratings: fluency +0.3, adequacy +0.4, cultural fidelity +0.5 (5-pt Likert); inter-annotator κ=0.71

Models are trainable on commodity high-memory GPU with modest wall time.

NumbersTraining completed on 1× A100 40GB in ~48 hours; processed ~7.5B tokens

Results

Recall@10 (retrieval)

ValueHybrid (Quantum-RAG) 70.1

BaselineFAISS only 62.7

BLEU (generation)

ValueQuantum-RAG vs mT5: +3.5 BLEU

BaselinemT5

Perplexity (language model)

ValuePunGPT2 2.24

BaselinemT5 28.5

ROUGE-L (summarization/generation)

ValueQuantum-RAG 40.1

BaselinemT5 33.2

Human cultural fidelity (Likert 1-5)

ValueQuantum-RAG 4.8/5

BaselinemT5 3.9/5

Who Should Care

What To Try In 7 Days

Run a FAISS + BM25 fusion prototype on your Punjabi corpus to check Recall@10 improvements.

Fine-tune a small decoder model or adapter with QLoRA for a target task (summarization or QA) on a single A100 or equivalent.

Evaluate cultural fidelity with a small native-speaker panel (10 people, ~100 prompts) to catch obvious gaps.

Optimization Features

Token Efficiency

  • 1024-token context length; BPE subword vocab to keep OOV low

Model Optimization

  • LoRA
  • BPE tokenizer tuned for Punjabi morphology (50k vocab, <2% OOV)

System Optimization

  • Training on single A100 40GB (MIG config) in ~48 hours

Training Optimization

  • Mixed-precision (FP16) training
  • Gradient accumulation to enable large effective batch size
  • Checkpointing every 5,000 steps

Inference Optimization

  • Retrieval augmentation to reduce hallucination (Pun-RAG)
  • Hybrid score fusion to avoid expensive exhaustive reranking

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Quantum kernel is 'quantum‑inspired' math, not a physical quantum computer; gains hinge on learned phase offsets and may not generalize.
  • Evaluation is limited to the authors' PunjabiEval and selected baselines; external replication needed.
  • Human evaluation used 10 annotators—helpful but small for full population variance.
  • Dataset sources include web and religious texts; potential domain and cultural biases may persist despite cleaning.

When Not To Use

  • If your target language has ample high-quality multilingual model support, building a dedicated stack may not be cost-effective.
  • If legal/privacy rules prevent releasing training data, the open-release advantages do not apply.
  • If you cannot validate cultural fidelity with native speakers, risk of unseen cultural errors increases.

Failure Modes

  • Quantum kernel weights could overfit to the released knowledge base and drop when corpus distribution shifts.
  • Hallucinations may persist for out-of-knowledge queries not covered by the retrieval index.
  • Cultural or religious content could be misrepresented despite higher average fidelity scores.

Core Entities

Models

  • PunGPT2
  • Pun-RAG
  • Pun-Instruct
  • Quantum-RAG
  • mT5
  • mBERT
  • MuRIL
  • BLOOM
  • LLaMA-2-7B

Metrics

  • Perplexity
  • ROUGE-L
  • BLEU
  • Recall@10
  • MRR
  • nDCG
  • Cultural fidelity (Likert)
  • Human eval κ

Datasets

  • 35GB Punjabi corpus (authors)
  • PunjabiEval (authors)
  • FLORES-200
  • IndicGenBench

Benchmarks

  • PunjabiEval
  • FLORES-200
  • IndicGenBench