Swap LLaMa's tokenizer for a Russian Unigram vocab to improve Russian quality and cut training/inference cost

Overview

Decision SnapshotNeeds Validation

The approach uses known techniques (vocab substitution, embedding initialization). Empirical gains are consistent but measured for one base model, one language, and one short continued-training regime.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 40%

Authors

Mikhail Tikhomirov, Daniil Chernyshev

Links

Abstract / PDF

Why It Matters For Business

Replacing an English-focused tokenizer with a language-specific Unigram vocab can improve non-English accuracy and cut fine-tuning and inference costs, lowering time-to-market and cloud bills for localized LLM products.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

The authors adapt LLaMa-7b to Russian by replacing its tokenizer with target-language vocabularies (Unigram and BPE), reinitializing embeddings/LM head, and training only those layers for one epoch on a 3.5B-word Russian corpus. The Unigram tokenizer preserves word roots better, yields consistent quality gains on Russian SuperGLUE (fine-tune mean 0.704 vs 0.681 baseline; zero-shot Saiga mean 0.509 vs 0.445 baseline), is preferred in human pairwise tests, and reduces compute: fine-tuning time dropped from 27h to 20h (~35%) and generation of a 15-sentence output ran ~17s vs ~27s (~60% faster). Results are specific to LLaMa-7b, one epoch of embedding tuning, and the datasets reported.

Problem Statement

Pretrained LLaMa uses an English-centered tokenizer that fragments Russian words. That harms both accuracy and efficiency for Russian downstream and instruction tasks. The paper asks whether swapping in a Russian tokenizer and reinitializing embeddings fixes quality and reduces compute.

Main Contribution

Compare BPE and Unigram tokenizers for Russian adaptation of LLaMa-7b.

Show Unigram keeps morphological roots better and gives highest downstream gains.

Key Findings

Unigram tokenization preserves word roots better than BPE.

Practical UseUse Unigram to keep stems intact for inflected languages like Russian; helps models understand morphology.

Evidence RefFig. 2, section V.A

Fine-tuned LLaMa with Unigram vocab improves Russian SuperGLUE mean score.

Numbersmean 0.704 vs 0.681 (baseline)

Practical UseExpect modest but consistent task gains by swapping to a language-specific Unigram vocab before fine-tuning.

Evidence RefTable I

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LoRA	0.704 (llama7b rulm unigram)	0.681 (llama7b)	+0.023	Russian SuperGLUE	Table I reports mean scores across 9 tasks	Table I
RSG mean (Saiga zero-shot)	0.509 (saiga7b rulm unigram)	0.445 (saiga7b)	+0.064	Russian SuperGLUE (zero-shot after instruction tuning)	Table II zero-shot means	Table II

What To Try In 7 Days

Train a Unigram SentencePiece tokenizer on a representative Russian corpus.

Rebuild embeddings/LM head and initialize new tokens from original embeddings by averaging overlap.

Train only embeddings and LM head (freeze rest) for 1 epoch and compare RSG or in-house tests and latency/memory.

Optimization Features

Token Efficiency

Unigram preserves morphology with comparable token countsfewer tokens -> faster generation

Infra Optimization

shorter wall-clock time reduces GPU hours

Model Optimization

vocabulary substitution

System Optimization

use FP16 weights and FlashAttention2 for speed

Training Optimization

freeze-base-model, train embeddings/LM head onlyLoRA

Inference Optimization

reduced token count per textlower memory footprintfaster sampling with ForceTokensLogitsProcessor

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Only embeddings and LM head were trained; full model effects unknown.

Continued pre-training lasted one epoch; authors note more training could change results.

When Not To Use

If the base model already has high pretraining coverage for the target language.

When you require end-to-end retraining of model weights rather than lightweight adaptation.

Failure Modes

Unigram may increase token length in edge cases and hurt some memory/latency trade-offs.

Embedding initialization by averaging could misrepresent rare tokens.

Core Entities

Models

LLaMa-7bSaiga7b (instruction-tuned LLaMa variant)

Metrics

mean score on Russian SuperGLUEAccuracyinference time (s)fine-tune wall-clock time (h)memory consumption

Datasets

Russian SuperGLUECustom Russian corpus (3.5B words, 9M documents)RuMorphsWords (morphological labels)

Swap LLaMa's tokenizer for a Russian Unigram vocab to improve Russian quality and cut training/inference cost

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Unigram tokenization preserves word roots better than BPE.

Fine-tuned LLaMa with Unigram vocab improves Russian SuperGLUE mean score.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Unigram tokenization preserves word roots better than BPE.

Fine-tuned LLaMa with Unigram vocab improves Russian SuperGLUE mean score.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

BiasLab: a multilingual, dual-framing toolkit for robust output-level bias audits

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

EthioLLM: open multilingual LLMs and a new EthioBenchmark for five Ethiopian languages plus English

Key finding

MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

Key finding