Overview
The idea is practical and validated with automatic metrics, human and GPT-4 checks and MT experiments, but relies on proprietary LLM calls, limited human annotation, and tests on EN/FR/JA only.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Cleaning noisy reference translations with an LLM yields low-noise evaluation sets that better reveal whether models truly handle noisy input; this helps teams avoid optimistic robustness claims and focus training effort where it actually helps.
Who Should Care
Summary TLDR
The authors apply GPT-3.5 prompts to clean noisy target sentences in the MTNT translation benchmark and release the cleaned dataset C-MTNT. They evaluate three prompt strategies (bilingual: source+noisy target, monolingual: noisy target only, translation: noisy source only). Human and GPT-4 judgments plus automatic measures show bilingual and translation methods reduce slang, emojis and profanities while preserving meaning. NMT models trained on augmented noisy sources show larger relative BLEU gains when evaluated on C-MTNT, suggesting it is a better benchmark of robustness than the original MTNT.
Problem Statement
MTNT contains natural noise on both source and target sides, which prevents using it to test whether an NMT model can translate noisy source text into a clean target. The paper aims to clean MTNT targets with an LLM so the dataset better measures robustness to noisy source input.
Main Contribution
Design three few-shot GPT-3.5 cleaning strategies: bilingual (use both source and noisy target), monolingual (target-only), and translation (source-only).
Create C-MTNT: cleaned target-side versions of MTNT for EN↔FR and EN→JA.
Key Findings
Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.
Cleaned sentences remain semantically close to originals under bilingual and monolingual methods.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Spelling/Grammar errors (per 100 tokens) in EN targets | MTNT 1.712 → Bilingual 0.687 | MTNT 1.712 | -1.025 per 100 toks | EN targets (Table 2) | Table 2 reports counts per 100 tokens | Table 2 |
| Emojis (per 100 tokens) in EN targets | MTNT 0.031 → Bilingual 0.0 | MTNT 0.031 | -0.031 per 100 toks | EN targets (Table 2) | Table 2 shows bilingual cleaning removes emojis in EN targets | Table 2 |
What To Try In 7 Days
Run a small bilingual prompt (source+noisy target) with GPT-3.5/GPT-4 on a validation slice to see noise removal quality.
Filter or replace noisy references with LLM-cleaned versions and compare model gains on that eval set vs original.
Add noisy-source / clean-target training examples and measure relative BLEU gain to detect robustness improvements.
Reproducibility
Risks & Boundaries
Limitations
Relies on proprietary GPT-3.5 API; cost and availability may limit large-scale cleaning.
Evaluations and examples focus on English, French and Japanese; other languages untested.
When Not To Use
When you must preserve original noisy target text verbatim (e.g., forensic analysis).
For languages or dialects not represented in the paper (unknown LLM performance).
Failure Modes
Monolingual cleaning can delete or add information and break alignment with the source (paper notes misalignment).
LLM hallucination or stylistic changes that reduce faithfulness to source meaning.

