Use GPT-3.5 to clean MTNT targets and build C-MTNT, a stronger noise benchmark

October 20, 20238 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and validated with automatic metrics, human and GPT-4 checks and MT experiments, but relies on proprietary LLM calls, limited human annotation, and tests on EN/FR/JA only.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Quinten Bolding, Baohao Liao, Brandon James Denis, Jun Luo, Christof Monz

Links

Abstract / PDF

Why It Matters For Business

Cleaning noisy reference translations with an LLM yields low-noise evaluation sets that better reveal whether models truly handle noisy input; this helps teams avoid optimistic robustness claims and focus training effort where it actually helps.

Who Should Care

Summary TLDR

The authors apply GPT-3.5 prompts to clean noisy target sentences in the MTNT translation benchmark and release the cleaned dataset C-MTNT. They evaluate three prompt strategies (bilingual: source+noisy target, monolingual: noisy target only, translation: noisy source only). Human and GPT-4 judgments plus automatic measures show bilingual and translation methods reduce slang, emojis and profanities while preserving meaning. NMT models trained on augmented noisy sources show larger relative BLEU gains when evaluated on C-MTNT, suggesting it is a better benchmark of robustness than the original MTNT.

Problem Statement

MTNT contains natural noise on both source and target sides, which prevents using it to test whether an NMT model can translate noisy source text into a clean target. The paper aims to clean MTNT targets with an LLM so the dataset better measures robustness to noisy source input.

Main Contribution

Design three few-shot GPT-3.5 cleaning strategies: bilingual (use both source and noisy target), monolingual (target-only), and translation (source-only).

Create C-MTNT: cleaned target-side versions of MTNT for EN↔FR and EN→JA.

Key Findings

Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.

NumbersEN spell/gram per 100 toks: MTNT 1.712 → Bilingual 0.687; FR: MTNT 7.125 → Bilingual 0.552 (Table 2)

Practical UseUse bilingual or translation prompts to clean references when you need low-noise target references for robustness tests.

Evidence RefTable 2

Cleaned sentences remain semantically close to originals under bilingual and monolingual methods.

NumbersLASER similarity: bilingual 0.94, monolingual 0.95, translation 0.89 (Fr→En; Table 5)

Practical UseLLM cleaning preserves meaning enough for evaluation if you choose bilingual or monolingual prompting and check embedding similarity.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Spelling/Grammar errors (per 100 tokens) in EN targetsMTNT 1.712 → Bilingual 0.687MTNT 1.712-1.025 per 100 toksEN targets (Table 2)Table 2 reports counts per 100 tokensTable 2
Emojis (per 100 tokens) in EN targetsMTNT 0.031 → Bilingual 0.0MTNT 0.031-0.031 per 100 toksEN targets (Table 2)Table 2 shows bilingual cleaning removes emojis in EN targetsTable 2

What To Try In 7 Days

Run a small bilingual prompt (source+noisy target) with GPT-3.5/GPT-4 on a validation slice to see noise removal quality.

Filter or replace noisy references with LLM-cleaned versions and compare model gains on that eval set vs original.

Add noisy-source / clean-target training examples and measure relative BLEU gain to detect robustness improvements.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Relies on proprietary GPT-3.5 API; cost and availability may limit large-scale cleaning.

Evaluations and examples focus on English, French and Japanese; other languages untested.

When Not To Use

When you must preserve original noisy target text verbatim (e.g., forensic analysis).

For languages or dialects not represented in the paper (unknown LLM performance).

Failure Modes

Monolingual cleaning can delete or add information and break alignment with the source (paper notes misalignment).

LLM hallucination or stylistic changes that reduce faithfulness to source meaning.

Core Entities

Models

GPT-3.5 (text-davinci-003)GPT-4Transformer (vanilla)BERT (used for augmentation)

Metrics

BLEULASER (sentence embedding cosine)Rouge-1JaccardJaro-Winkler

Datasets

MTNTC-MTNT (new)Newstest2014TEDKFTTJESCeuroparl-v7news-commentary v10

Benchmarks

MTNTC-MTNTNewstest2014TEDKFTTJESC

Context Entities

Models

Llama 2 (mentioned as likely capable)

Datasets

WMT15 dev/test (used for validation/testing contexts)