Open-source Punjabi LLM suite + a quantum‑inspired hybrid retriever that improves retrieval and generation.

Overview

Decision SnapshotReady For Pilot

Paper provides released models, concrete training details (hardware, time, tokens) and multiple evaluation lenses (automatic + human). Retrieval gains are backed by ablations and retrieval metrics. Some claims rely on authors' released artifacts and on-paper evaluations.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 72%

Authors

Jaskaranjeet Singh, Rakesh Thakur

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tools that speak a local language well unlock use cases (education, news, local QA, civic services). A dedicated model + hybrid retrieval gives measurably better accuracy and cultural fit than off-the-shelf multilingual models.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors build PunGPT2 (a 124M-parameter Punjabi decoder model) trained on a 35GB curated Punjabi corpus, and release three system variants: Pun-RAG (FAISS-based RAG), Pun-Instruct (QLoRA instruction-tuned), and Quantum-RAG — a hybrid retriever that fuses BM25, FAISS dense embeddings, and a quantum‑inspired kernel. They also release PunjabiEval. On the paper's evaluations Quantum-RAG raises Recall@10 by +7.4 points over FAISS and improves generation metrics (example: +3.5 BLEU vs mT5). Training used a single A100 40GB in ~48 hours. Code, data, and weights are claimed as released.

Problem Statement

Punjabi is underrepresented in multilingual LLMs. Poor tokenization and tiny training presence lead to high perplexity and weak generation. The paper aims to provide a dedicated Punjabi model, retrieval grounding, and a benchmark.

Main Contribution

PunGPT2: first decoder-only Punjabi LLM trained on a 35GB curated Punjabi corpus.

Pun-RAG: a FAISS dense retriever based RAG pipeline for Punjabi.

Key Findings

A 35GB, 4.8M-document Punjabi corpus was assembled and used for training.

Numbers35.5GB corpus, ~4,800,000 documents, 32GB train / 2GB val / 1GB test

Practical UseIf you need a Punjabi pretraining dataset, this corpus (released by authors) can bootstrap models and downstream evaluation.

Evidence RefSection 3, Table 2

PunGPT2 achieves much lower perplexity than multilingual baselines on evaluated data.

NumbersPerplexity: PunGPT2 2.24 vs mT5 28.5 (on paper's test split)

Practical UseA language-specific decoder trained on a sizable curated corpus can drastically improve language modeling quality for Punjabi — use a dedicated model rather than relying only on large multilingual checkpoints.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Recall@10 (retrieval)	Hybrid (Quantum-RAG) 70.1	FAISS only 62.7	+7.4	Paper retrieval test set / Punjabi knowledge base	Table 9 reports Recall@10 for BM25, FAISS, Quantum-only, and Hybrid.	Table 9
BLEU (generation)	Quantum-RAG vs mT5: +3.5 BLEU	mT5	+3.5 BLEU	PunjabiEval	Abstract and evaluation section state +3.5 BLEU over mT5 on PunjabiEval.	Abstract

What To Try In 7 Days

Run a FAISS + BM25 fusion prototype on your Punjabi corpus to check Recall@10 improvements.

Fine-tune a small decoder model or adapter with QLoRA for a target task (summarization or QA) on a single A100 or equivalent.

Evaluate cultural fidelity with a small native-speaker panel (10 people, ~100 prompts) to catch obvious gaps.

Optimization Features

Token Efficiency

1024-token context length; BPE subword vocab to keep OOV low

Model Optimization

LoRABPE tokenizer tuned for Punjabi morphology (50k vocab, <2% OOV)

System Optimization

Training on single A100 40GB (MIG config) in ~48 hours

Training Optimization

Mixed-precision (FP16) trainingGradient accumulation to enable large effective batch sizeCheckpointing every 5,000 steps

Inference Optimization

Retrieval augmentation to reduce hallucination (Pun-RAG)Hybrid score fusion to avoid expensive exhaustive reranking

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://arxiv.org/abs/2508.01918 https://arxiv.org/pdf/2508.01918v2

Data URLs

https://arxiv.org/abs/2508.01918 https://arxiv.org/pdf/2508.01918v2

Risks & Boundaries

Limitations

Quantum kernel is 'quantum‑inspired' math, not a physical quantum computer; gains hinge on learned phase offsets and may not generalize.

Evaluation is limited to the authors' PunjabiEval and selected baselines; external replication needed.

When Not To Use

If your target language has ample high-quality multilingual model support, building a dedicated stack may not be cost-effective.

If legal/privacy rules prevent releasing training data, the open-release advantages do not apply.

Failure Modes

Quantum kernel weights could overfit to the released knowledge base and drop when corpus distribution shifts.

Hallucinations may persist for out-of-knowledge queries not covered by the retrieval index.

Core Entities

Models

PunGPT2Pun-RAGPun-InstructQuantum-RAGmT5mBERTMuRILBLOOMLLaMA-2-7B

Metrics

PerplexityROUGE-LBLEURecall@10MRRnDCGCultural fidelity (Likert)Human eval κ

Datasets

35GB Punjabi corpus (authors)PunjabiEval (authors)FLORES-200IndicGenBench

Benchmarks

PunjabiEvalFLORES-200IndicGenBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A 35GB, 4.8M-document Punjabi corpus was assembled and used for training.

PunGPT2 achieves much lower perplexity than multilingual baselines on evaluated data.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f