Swap LLaMa's tokenizer for a Russian Unigram vocab to improve Russian quality and cut training/inference cost

December 5, 20236 min

Overview

Decision SnapshotNeeds Validation

The approach uses known techniques (vocab substitution, embedding initialization). Empirical gains are consistent but measured for one base model, one language, and one short continued-training regime.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 40%

Authors

Mikhail Tikhomirov, Daniil Chernyshev

Links

Abstract / PDF

Why It Matters For Business

Replacing an English-focused tokenizer with a language-specific Unigram vocab can improve non-English accuracy and cut fine-tuning and inference costs, lowering time-to-market and cloud bills for localized LLM products.

Who Should Care

Summary TLDR

The authors adapt LLaMa-7b to Russian by replacing its tokenizer with target-language vocabularies (Unigram and BPE), reinitializing embeddings/LM head, and training only those layers for one epoch on a 3.5B-word Russian corpus. The Unigram tokenizer preserves word roots better, yields consistent quality gains on Russian SuperGLUE (fine-tune mean 0.704 vs 0.681 baseline; zero-shot Saiga mean 0.509 vs 0.445 baseline), is preferred in human pairwise tests, and reduces compute: fine-tuning time dropped from 27h to 20h (~35%) and generation of a 15-sentence output ran ~17s vs ~27s (~60% faster). Results are specific to LLaMa-7b, one epoch of embedding tuning, and the datasets reported.

Problem Statement

Pretrained LLaMa uses an English-centered tokenizer that fragments Russian words. That harms both accuracy and efficiency for Russian downstream and instruction tasks. The paper asks whether swapping in a Russian tokenizer and reinitializing embeddings fixes quality and reduces compute.

Main Contribution

Compare BPE and Unigram tokenizers for Russian adaptation of LLaMa-7b.

Show Unigram keeps morphological roots better and gives highest downstream gains.

Key Findings

Unigram tokenization preserves word roots better than BPE.

Practical UseUse Unigram to keep stems intact for inflected languages like Russian; helps models understand morphology.

Evidence RefFig. 2, section V.A

Fine-tuned LLaMa with Unigram vocab improves Russian SuperGLUE mean score.

Numbersmean 0.704 vs 0.681 (baseline)

Practical UseExpect modest but consistent task gains by swapping to a language-specific Unigram vocab before fine-tuning.

Evidence RefTable I

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LoRA0.704 (llama7b rulm unigram)0.681 (llama7b)+0.023Russian SuperGLUETable I reports mean scores across 9 tasksTable I
RSG mean (Saiga zero-shot)0.509 (saiga7b rulm unigram)0.445 (saiga7b)+0.064Russian SuperGLUE (zero-shot after instruction tuning)Table II zero-shot meansTable II

What To Try In 7 Days

Train a Unigram SentencePiece tokenizer on a representative Russian corpus.

Rebuild embeddings/LM head and initialize new tokens from original embeddings by averaging overlap.

Train only embeddings and LM head (freeze rest) for 1 epoch and compare RSG or in-house tests and latency/memory.

Optimization Features

Token Efficiency
Unigram preserves morphology with comparable token countsfewer tokens -> faster generation
Infra Optimization
shorter wall-clock time reduces GPU hours
Model Optimization
vocabulary substitution
System Optimization
use FP16 weights and FlashAttention2 for speed
Training Optimization
freeze-base-model, train embeddings/LM head onlyLoRA
Inference Optimization
reduced token count per textlower memory footprintfaster sampling with ForceTokensLogitsProcessor

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only embeddings and LM head were trained; full model effects unknown.

Continued pre-training lasted one epoch; authors note more training could change results.

When Not To Use

If the base model already has high pretraining coverage for the target language.

When you require end-to-end retraining of model weights rather than lightweight adaptation.

Failure Modes

Unigram may increase token length in edge cases and hurt some memory/latency trade-offs.

Embedding initialization by averaging could misrepresent rare tokens.

Core Entities

Models

LLaMa-7bSaiga7b (instruction-tuned LLaMa variant)

Metrics

mean score on Russian SuperGLUEAccuracyinference time (s)fine-tune wall-clock time (h)memory consumption

Datasets

Russian SuperGLUECustom Russian corpus (3.5B words, 9M documents)RuMorphsWords (morphological labels)

Benchmarks

Russian SuperGLUE

Context Entities

Models

ruGPT-3.5ruT5-large