Fine-tuning Llama 3 8B on translation memories improves translations — gains appear reliably once you have ~5k in-domain examples

September 5, 20248 min

Overview

Decision SnapshotReady For Pilot

The experiments use a realistic in-house TM and standard metrics, but results are scoped to one organisation, one model family, and may be inflated by possible pretraining/test leakage.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 30%

Authors

Inacio Vieira, Will Allred, Séamus Lankford, Sheila Castilho, Andy Way

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning a midsize LLM on your own translation memories can give big, focused quality gains — especially for low-resource languages — but only if you have enough in-domain data (roughly ≥5k examples).

Who Should Care

Summary TLDR

The authors fine-tuned Llama 3 8B Instruct on company translation memories (TMs) for five English→X directions and varied training sizes from 1k to 100k+. Using QLoRA + LoRA (4-bit) and standard MT metrics (BLEU, chrF++, TER, COMET), they find fine-tuning hurts when using only 1–2k examples, improves from 5k upward, and gives the largest gains at 100k+ (avg BLEU +13.7, COMET +25 vs baseline). Low-resource languages (Korean) benefit most. Caveats: narrow domain, potential pretraining/test-set leakage, and limited human evaluation.

Problem Statement

Companies with translation memories want to know how much in-house data is needed to fine-tune a midsize LLM for better, faster, organisation-specific translation while keeping cost and time reasonable.

Main Contribution

Empirical study of fine-tuning Llama 3 8B on real in-house translation memories across five target languages and multiple dataset sizes (1k, 2k, 5k, 10k, 14.7k, 100k+).

Practical fine-tuning recipe: QLoRA (4-bit) + LoRA PEFT on 4× A100 GPUs; inference via CTranslate2 (8‑bit).

Key Findings

Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.

Numbersavg BLEU +13.7; avg COMET +25 (100k+ vs baseline)

Practical UseIf you can gather 100k+ domain-aligned TM segments, fine-tune the model — expect clear automatic-metric improvements and stronger domain fit.

Evidence RefResults section & Table 4

Very small fine-tuning sets can hurt translation quality.

Numbers1k/2k training sets show lower metrics than baseline (e.g., PT-BR BLEU 2k 46.04 < baseline 48.25)

Practical UseDo not blindly fine-tune with only 1–2k segments; either collect more data, tune regularisation/hyperparameters, or use other adaptation strategies.

Evidence RefSection 3.1 and Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BLEU (avg across languages)+13.7 (100k+ vs baseline)baseline Llama 3 8B+13.7100k+ full training setsSection 3: '100k+ datasets... average increase of 13.7 BLEU' (Table 4)Table 4
COMET (avg across languages)+25 (100k+ vs baseline)baseline Llama 3 8B+25100k+ full training setsSection 3: '100k+ datasets... average increase of 25 COMET' (Table 4)Table 4

What To Try In 7 Days

Inventory your TMs and count aligned segments per language; flag languages with ≥5k and those below 5k.

Run a small fine-tune pilot: use QLoRA + LoRA on 10k aligned segments for a key language and compare BLEU/COMET to the baseline.

If you only have <5k, avoid blind fine-tuning; try hyperparameter regularisation or augment data before tuning.

Optimization Features

Token Efficiency
Structured JSON output to simplify post-processing
Model Optimization
4-bit quantisation (nf4) for weightsLoRA
System Optimization
Training on 4× A100 GPUs; largest runs ≈2.3–2.4 hours reported
Training Optimization
LoRASFT
Inference Optimization
CTranslate2 conversion with 8-bit quantisationsampling top-k=1 for deterministic outputs

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Data comes from a single company's narrow software-domain TMs; results may not generalise to broad domains.

Possible pretraining/test-set leakage acknowledged; gains may be overestimated.

When Not To Use

If you only have 1–2k aligned high-quality segments and cannot tune hyperparameters — naive fine-tuning often hurts.

When you need broad-domain translation rather than a narrow, domain-specific model.

Failure Modes

Overfitting and metric drop when fine-tuning on very small sets (1–2k).

Overgeneration and stray HTML/tokens, especially with small-data models.

Core Entities

Models

Llama 3 8B InstructGPT-3.5 (baseline comparator)

Metrics

BLEUchrF++TERCOMETCOMET-Kiwi

Datasets

In-house translation memories (software sector)Aligned subsets: 1k, 2k, 5k, 10k, 14.7kFull training sets per language: 100k+ ranges (107k–223k)

Benchmarks

SacreBLEU evaluation suiteCOMETCOMET-Kiwi (quality-estimation on training data)