Fine-tuning Llama 3 8B on translation memories improves translations — gains appear reliably once you have ~5k in-domain examples

September 5, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.7

Citation Count

1

Authors

Inacio Vieira, Will Allred, Séamus Lankford, Sheila Castilho, Andy Way

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning a midsize LLM on your own translation memories can give big, focused quality gains — especially for low-resource languages — but only if you have enough in-domain data (roughly ≥5k examples).

Summary TLDR

The authors fine-tuned Llama 3 8B Instruct on company translation memories (TMs) for five English→X directions and varied training sizes from 1k to 100k+. Using QLoRA + LoRA (4-bit) and standard MT metrics (BLEU, chrF++, TER, COMET), they find fine-tuning hurts when using only 1–2k examples, improves from 5k upward, and gives the largest gains at 100k+ (avg BLEU +13.7, COMET +25 vs baseline). Low-resource languages (Korean) benefit most. Caveats: narrow domain, potential pretraining/test-set leakage, and limited human evaluation.

Problem Statement

Companies with translation memories want to know how much in-house data is needed to fine-tune a midsize LLM for better, faster, organisation-specific translation while keeping cost and time reasonable.

Main Contribution

Empirical study of fine-tuning Llama 3 8B on real in-house translation memories across five target languages and multiple dataset sizes (1k, 2k, 5k, 10k, 14.7k, 100k+).

Practical fine-tuning recipe: QLoRA (4-bit) + LoRA PEFT on 4× A100 GPUs; inference via CTranslate2 (8‑bit).

Quantified thresholds and returns: small sets (1–2k) often degrade quality; gains start at ≈5k and grow with data, with large gains at 100k+; human post-edit of 100 segments confirms qualitative issues (ambiguity, context).

Key Findings

Large-scale fine-tuning yields substantial metric gains versus the out-of-the-box model.

Numbersavg BLEU +13.7; avg COMET +25 (100k+ vs baseline)

Very small fine-tuning sets can hurt translation quality.

Numbers1k/2k training sets show lower metrics than baseline (e.g., PT-BR BLEU 2k 46.04 < baseline 48.25)

A practical minimum appears around 5k aligned in-domain examples.

Numbers14.7k aligned set: avg BLEU +4.8 (17.4%), chrF++ +7.1, COMET +16.9 vs baseline

Low-resource directions benefit proportionally more from in-house fine-tuning.

NumbersKorean COMET 36.45 → 84.3 (≈+130% relative) from baseline to 100k+

Results

BLEU (avg across languages)

Value+13.7 (100k+ vs baseline)

Baselinebaseline Llama 3 8B

COMET (avg across languages)

Value+25 (100k+ vs baseline)

Baselinebaseline Llama 3 8B

Performance drop for very small sets

Valueworse than baseline for 1k and 2k

Baselinebaseline Llama 3 8B

14.7k aligned set (avg change)

ValueBLEU +4.8; chrF++ +7.1; COMET +16.9; TER −9

Baselinebaseline Llama 3 8B

Who Should Care

What To Try In 7 Days

Inventory your TMs and count aligned segments per language; flag languages with ≥5k and those below 5k.

Run a small fine-tune pilot: use QLoRA + LoRA on 10k aligned segments for a key language and compare BLEU/COMET to the baseline.

If you only have <5k, avoid blind fine-tuning; try hyperparameter regularisation or augment data before tuning.

Optimization Features

Token Efficiency

  • Structured JSON output to simplify post-processing

Model Optimization

  • 4-bit quantisation (nf4) for weights
  • LoRA

System Optimization

  • Training on 4× A100 GPUs; largest runs ≈2.3–2.4 hours reported

Training Optimization

  • LoRA
  • SFT

Inference Optimization

  • CTranslate2 conversion with 8-bit quantisation
  • sampling top-k=1 for deterministic outputs

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Data comes from a single company's narrow software-domain TMs; results may not generalise to broad domains.
  • Possible pretraining/test-set leakage acknowledged; gains may be overestimated.
  • Human evaluation is small (100 post-edited segments) and qualitative.

When Not To Use

  • If you only have 1–2k aligned high-quality segments and cannot tune hyperparameters — naive fine-tuning often hurts.
  • When you need broad-domain translation rather than a narrow, domain-specific model.
  • If you require strong document-level context — the model translates segments in isolation and can miss cross-segment info.

Failure Modes

  • Overfitting and metric drop when fine-tuning on very small sets (1–2k).
  • Overgeneration and stray HTML/tokens, especially with small-data models.
  • Poor handling of ambiguous segments lacking context; human rework required.

Core Entities

Models

  • Llama 3 8B Instruct
  • GPT-3.5 (baseline comparator)

Metrics

  • BLEU
  • chrF++
  • TER
  • COMET
  • COMET-Kiwi

Datasets

  • In-house translation memories (software sector)
  • Aligned subsets: 1k, 2k, 5k, 10k, 14.7k
  • Full training sets per language: 100k+ ranges (107k–223k)

Benchmarks

  • SacreBLEU evaluation suite
  • COMET
  • COMET-Kiwi (quality-estimation on training data)