Overview
The experiments use a realistic in-house TM and standard metrics, but results are scoped to one organisation, one model family, and may be inflated by possible pretraining/test leakage.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
Fine-tuning a midsize LLM on your own translation memories can give big, focused quality gains — especially for low-resource languages — but only if you have enough in-domain data (roughly ≥5k examples).
Who Should Care
Summary TLDR
The authors fine-tuned Llama 3 8B Instruct on company translation memories (TMs) for five English→X directions and varied training sizes from 1k to 100k+. Using QLoRA + LoRA (4-bit) and standard MT metrics (BLEU, chrF++, TER, COMET), they find fine-tuning hurts when using only 1–2k examples, improves from 5k upward, and gives the largest gains at 100k+ (avg BLEU +13.7, COMET +25 vs baseline). Low-resource languages (Korean) benefit most. Caveats: narrow domain, potential pretraining/test-set leakage, and limited human evaluation.
Problem Statement
Companies with translation memories want to know how much in-house data is needed to fine-tune a midsize LLM for better, faster, organisation-specific translation while keeping cost and time reasonable.
Main Contribution
Empirical study of fine-tuning Llama 3 8B on real in-house translation memories across five target languages and multiple dataset sizes (1k, 2k, 5k, 10k, 14.7k, 100k+).
Practical fine-tuning recipe: QLoRA (4-bit) + LoRA PEFT on 4× A100 GPUs; inference via CTranslate2 (8‑bit).
Key Findings
Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.
Very small fine-tuning sets can hurt translation quality.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BLEU (avg across languages) | +13.7 (100k+ vs baseline) | baseline Llama 3 8B | +13.7 | 100k+ full training sets | Section 3: '100k+ datasets... average increase of 13.7 BLEU' (Table 4) | Table 4 |
| COMET (avg across languages) | +25 (100k+ vs baseline) | baseline Llama 3 8B | +25 | 100k+ full training sets | Section 3: '100k+ datasets... average increase of 25 COMET' (Table 4) | Table 4 |
What To Try In 7 Days
Inventory your TMs and count aligned segments per language; flag languages with ≥5k and those below 5k.
Run a small fine-tune pilot: use QLoRA + LoRA on 10k aligned segments for a key language and compare BLEU/COMET to the baseline.
If you only have <5k, avoid blind fine-tuning; try hyperparameter regularisation or augment data before tuning.
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Data comes from a single company's narrow software-domain TMs; results may not generalise to broad domains.
Possible pretraining/test-set leakage acknowledged; gains may be overestimated.
When Not To Use
If you only have 1–2k aligned high-quality segments and cannot tune hyperparameters — naive fine-tuning often hurts.
When you need broad-domain translation rather than a narrow, domain-specific model.
Failure Modes
Overfitting and metric drop when fine-tuning on very small sets (1–2k).
Overgeneration and stray HTML/tokens, especially with small-data models.

