Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Fine-tuning a midsize LLM on your own translation memories can give big, focused quality gains — especially for low-resource languages — but only if you have enough in-domain data (roughly ≥5k examples).
Summary TLDR
The authors fine-tuned Llama 3 8B Instruct on company translation memories (TMs) for five English→X directions and varied training sizes from 1k to 100k+. Using QLoRA + LoRA (4-bit) and standard MT metrics (BLEU, chrF++, TER, COMET), they find fine-tuning hurts when using only 1–2k examples, improves from 5k upward, and gives the largest gains at 100k+ (avg BLEU +13.7, COMET +25 vs baseline). Low-resource languages (Korean) benefit most. Caveats: narrow domain, potential pretraining/test-set leakage, and limited human evaluation.
Problem Statement
Companies with translation memories want to know how much in-house data is needed to fine-tune a midsize LLM for better, faster, organisation-specific translation while keeping cost and time reasonable.
Main Contribution
Empirical study of fine-tuning Llama 3 8B on real in-house translation memories across five target languages and multiple dataset sizes (1k, 2k, 5k, 10k, 14.7k, 100k+).
Practical fine-tuning recipe: QLoRA (4-bit) + LoRA PEFT on 4× A100 GPUs; inference via CTranslate2 (8‑bit).
Quantified thresholds and returns: small sets (1–2k) often degrade quality; gains start at ≈5k and grow with data, with large gains at 100k+; human post-edit of 100 segments confirms qualitative issues (ambiguity, context).
Key Findings
Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.
Very small fine-tuning sets can hurt translation quality.
A practical minimum appears around 5k aligned in-domain examples.
Low-resource directions benefit proportionally more from in-house fine-tuning.
Results
BLEU (avg across languages)
COMET (avg across languages)
Performance drop for very small sets
14.7k aligned set (avg change)
Who Should Care
What To Try In 7 Days
Inventory your TMs and count aligned segments per language; flag languages with ≥5k and those below 5k.
Run a small fine-tune pilot: use QLoRA + LoRA on 10k aligned segments for a key language and compare BLEU/COMET to the baseline.
If you only have <5k, avoid blind fine-tuning; try hyperparameter regularisation or augment data before tuning.
Optimization Features
Token Efficiency
- Structured JSON output to simplify post-processing
Model Optimization
- 4-bit quantisation (nf4) for weights
- LoRA
System Optimization
- Training on 4× A100 GPUs; largest runs ≈2.3–2.4 hours reported
Training Optimization
- LoRA
- SFT
Inference Optimization
- CTranslate2 conversion with 8-bit quantisation
- sampling top-k=1 for deterministic outputs
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data comes from a single company's narrow software-domain TMs; results may not generalise to broad domains.
- Possible pretraining/test-set leakage acknowledged; gains may be overestimated.
- Human evaluation is small (100 post-edited segments) and qualitative.
When Not To Use
- If you only have 1–2k aligned high-quality segments and cannot tune hyperparameters — naive fine-tuning often hurts.
- When you need broad-domain translation rather than a narrow, domain-specific model.
- If you require strong document-level context — the model translates segments in isolation and can miss cross-segment info.
Failure Modes
- Overfitting and metric drop when fine-tuning on very small sets (1–2k).
- Overgeneration and stray HTML/tokens, especially with small-data models.
- Poor handling of ambiguous segments lacking context; human rework required.
Core Entities
Models
- Llama 3 8B Instruct
- GPT-3.5 (baseline comparator)
Metrics
- BLEU
- chrF++
- TER
- COMET
- COMET-Kiwi
Datasets
- In-house translation memories (software sector)
- Aligned subsets: 1k, 2k, 5k, 10k, 14.7k
- Full training sets per language: 100k+ ranges (107k–223k)
Benchmarks
- SacreBLEU evaluation suite
- COMET
- COMET-Kiwi (quality-estimation on training data)

