Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Replacing an English-focused tokenizer with a language-specific Unigram vocab can improve non-English accuracy and cut fine-tuning and inference costs, lowering time-to-market and cloud bills for localized LLM products.
Summary TLDR
The authors adapt LLaMa-7b to Russian by replacing its tokenizer with target-language vocabularies (Unigram and BPE), reinitializing embeddings/LM head, and training only those layers for one epoch on a 3.5B-word Russian corpus. The Unigram tokenizer preserves word roots better, yields consistent quality gains on Russian SuperGLUE (fine-tune mean 0.704 vs 0.681 baseline; zero-shot Saiga mean 0.509 vs 0.445 baseline), is preferred in human pairwise tests, and reduces compute: fine-tuning time dropped from 27h to 20h (~35%) and generation of a 15-sentence output ran ~17s vs ~27s (~60% faster). Results are specific to LLaMa-7b, one epoch of embedding tuning, and the datasets reported.
Problem Statement
Pretrained LLaMa uses an English-centered tokenizer that fragments Russian words. That harms both accuracy and efficiency for Russian downstream and instruction tasks. The paper asks whether swapping in a Russian tokenizer and reinitializing embeddings fixes quality and reduces compute.
Main Contribution
Compare BPE and Unigram tokenizers for Russian adaptation of LLaMa-7b.
Show Unigram keeps morphological roots better and gives highest downstream gains.
Provide an embedding/LM-head substitution procedure and continued training of only those layers.
Measure both automatic (Russian SuperGLUE) and human preferences plus runtime and memory gains.
Key Findings
Unigram tokenization preserves word roots better than BPE.
Fine-tuned LLaMa with Unigram vocab improves Russian SuperGLUE mean score.
Instruction-tuned (Saiga) zero-shot performance rises after Unigram substitution.
Human annotators preferred Unigram-adapted Saiga outputs over original Saiga.
Vocabulary substitution reduces training and inference cost.
Results
LoRA
RSG mean (Saiga zero-shot)
Fine-tune wall-clock time
Inference time (15-sentence story)
Who Should Care
What To Try In 7 Days
Train a Unigram SentencePiece tokenizer on a representative Russian corpus.
Rebuild embeddings/LM head and initialize new tokens from original embeddings by averaging overlap.
Train only embeddings and LM head (freeze rest) for 1 epoch and compare RSG or in-house tests and latency/memory.
Optimization Features
Token Efficiency
- Unigram preserves morphology with comparable token counts
- fewer tokens -> faster generation
Infra Optimization
- shorter wall-clock time reduces GPU hours
Model Optimization
- vocabulary substitution
System Optimization
- use FP16 weights and FlashAttention2 for speed
Training Optimization
- freeze-base-model, train embeddings/LM head only
- LoRA
Inference Optimization
- reduced token count per text
- lower memory footprint
- faster sampling with ForceTokensLogitsProcessor
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only embeddings and LM head were trained; full model effects unknown.
- Continued pre-training lasted one epoch; authors note more training could change results.
- Dataset link is hidden; replication needs similar Russian corpus.
- Experiments run on LLaMa-7b; results may not transfer to other sizes or architectures.
When Not To Use
- If the base model already has high pretraining coverage for the target language.
- When you require end-to-end retraining of model weights rather than lightweight adaptation.
- If you cannot assemble a representative target-language corpus for tokenizer training.
Failure Modes
- Unigram may increase token length in edge cases and hurt some memory/latency trade-offs.
- Embedding initialization by averaging could misrepresent rare tokens.
- Improvements may vanish if downstream tasks differ greatly from training corpus.
Core Entities
Models
- LLaMa-7b
- Saiga7b (instruction-tuned LLaMa variant)
Metrics
- mean score on Russian SuperGLUE
- Accuracy
- inference time (s)
- fine-tune wall-clock time (h)
- memory consumption
Datasets
- Russian SuperGLUE
- Custom Russian corpus (3.5B words, 9M documents)
- RuMorphsWords (morphological labels)
Benchmarks
- Russian SuperGLUE
Context Entities
Models
- ruGPT-3.5
- ruT5-large

