Fine-tuning Llama 3 8B on translation memories improves translations — gains appear reliably once you have ~5k in-domain examples

Overview

Decision SnapshotReady For Pilot

The experiments use a realistic in-house TM and standard metrics, but results are scoped to one organisation, one model family, and may be inflated by possible pretraining/test leakage.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 30%

Authors

Inacio Vieira, Will Allred, Séamus Lankford, Sheila Castilho, Andy Way

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning a midsize LLM on your own translation memories can give big, focused quality gains — especially for low-resource languages — but only if you have enough in-domain data (roughly ≥5k examples).

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

The authors fine-tuned Llama 3 8B Instruct on company translation memories (TMs) for five English→X directions and varied training sizes from 1k to 100k+. Using QLoRA + LoRA (4-bit) and standard MT metrics (BLEU, chrF++, TER, COMET), they find fine-tuning hurts when using only 1–2k examples, improves from 5k upward, and gives the largest gains at 100k+ (avg BLEU +13.7, COMET +25 vs baseline). Low-resource languages (Korean) benefit most. Caveats: narrow domain, potential pretraining/test-set leakage, and limited human evaluation.

Problem Statement

Companies with translation memories want to know how much in-house data is needed to fine-tune a midsize LLM for better, faster, organisation-specific translation while keeping cost and time reasonable.

Main Contribution

Empirical study of fine-tuning Llama 3 8B on real in-house translation memories across five target languages and multiple dataset sizes (1k, 2k, 5k, 10k, 14.7k, 100k+).

Practical fine-tuning recipe: QLoRA (4-bit) + LoRA PEFT on 4× A100 GPUs; inference via CTranslate2 (8‑bit).

Key Findings

Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.

Numbersavg BLEU +13.7; avg COMET +25 (100k+ vs baseline)

Practical UseIf you can gather 100k+ domain-aligned TM segments, fine-tune the model — expect clear automatic-metric improvements and stronger domain fit.

Evidence RefResults section & Table 4

Very small fine-tuning sets can hurt translation quality.

Numbers1k/2k training sets show lower metrics than baseline (e.g., PT-BR BLEU 2k 46.04 < baseline 48.25)

Practical UseDo not blindly fine-tune with only 1–2k segments; either collect more data, tune regularisation/hyperparameters, or use other adaptation strategies.

Evidence RefSection 3.1 and Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU (avg across languages)	+13.7 (100k+ vs baseline)	baseline Llama 3 8B	+13.7	100k+ full training sets	Section 3: '100k+ datasets... average increase of 13.7 BLEU' (Table 4)	Table 4
COMET (avg across languages)	+25 (100k+ vs baseline)	baseline Llama 3 8B	+25	100k+ full training sets	Section 3: '100k+ datasets... average increase of 25 COMET' (Table 4)	Table 4

What To Try In 7 Days

Inventory your TMs and count aligned segments per language; flag languages with ≥5k and those below 5k.

Run a small fine-tune pilot: use QLoRA + LoRA on 10k aligned segments for a key language and compare BLEU/COMET to the baseline.

If you only have <5k, avoid blind fine-tuning; try hyperparameter regularisation or augment data before tuning.

Optimization Features

Token Efficiency

Structured JSON output to simplify post-processing

Model Optimization

4-bit quantisation (nf4) for weightsLoRA

System Optimization

Training on 4× A100 GPUs; largest runs ≈2.3–2.4 hours reported

Training Optimization

LoRASFT

Inference Optimization

CTranslate2 conversion with 8-bit quantisationsampling top-k=1 for deterministic outputs

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Data comes from a single company's narrow software-domain TMs; results may not generalise to broad domains.

Possible pretraining/test-set leakage acknowledged; gains may be overestimated.

When Not To Use

If you only have 1–2k aligned high-quality segments and cannot tune hyperparameters — naive fine-tuning often hurts.

When you need broad-domain translation rather than a narrow, domain-specific model.

Failure Modes

Overfitting and metric drop when fine-tuning on very small sets (1–2k).

Overgeneration and stray HTML/tokens, especially with small-data models.

Core Entities

Models

Llama 3 8B InstructGPT-3.5 (baseline comparator)

Metrics

BLEUchrF++TERCOMETCOMET-Kiwi

Datasets

In-house translation memories (software sector)Aligned subsets: 1k, 2k, 5k, 10k, 14.7kFull training sets per language: 100k+ ranges (107k–223k)

Benchmarks

SacreBLEU evaluation suiteCOMETCOMET-Kiwi (quality-estimation on training data)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.

Very small fine-tuning sets can hurt translation quality.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding