Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Cleaning noisy reference translations with an LLM yields low-noise evaluation sets that better reveal whether models truly handle noisy input; this helps teams avoid optimistic robustness claims and focus training effort where it actually helps.
Summary TLDR
The authors apply GPT-3.5 prompts to clean noisy target sentences in the MTNT translation benchmark and release the cleaned dataset C-MTNT. They evaluate three prompt strategies (bilingual: source+noisy target, monolingual: noisy target only, translation: noisy source only). Human and GPT-4 judgments plus automatic measures show bilingual and translation methods reduce slang, emojis and profanities while preserving meaning. NMT models trained on augmented noisy sources show larger relative BLEU gains when evaluated on C-MTNT, suggesting it is a better benchmark of robustness than the original MTNT.
Problem Statement
MTNT contains natural noise on both source and target sides, which prevents using it to test whether an NMT model can translate noisy source text into a clean target. The paper aims to clean MTNT targets with an LLM so the dataset better measures robustness to noisy source input.
Main Contribution
Design three few-shot GPT-3.5 cleaning strategies: bilingual (use both source and noisy target), monolingual (target-only), and translation (source-only).
Create C-MTNT: cleaned target-side versions of MTNT for EN↔FR and EN→JA.
Quantify noise reduction (spelling/grammar, emojis, slang, profanities) and meaning preservation with automatic metrics and human/GPT-4 judgments.
Show NMT models trained with noisy-source / clean-target data get larger relative BLEU gains on C-MTNT vs MTNT, supporting C-MTNT as a noise-evaluation benchmark.
Key Findings
Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.
Cleaned sentences remain semantically close to originals under bilingual and monolingual methods.
Human and GPT-4 preferences agree: bilingual > translation > monolingual for overall cleaning quality.
NMT models trained on noisy-source / clean-target data get higher relative BLEU gains on C-MTNT than on MTNT.
Results
Spelling/Grammar errors (per 100 tokens) in EN targets
Emojis (per 100 tokens) in EN targets
Semantic similarity (LASER)
Relative BLEU gain G (%) from robustness training
Who Should Care
What To Try In 7 Days
Run a small bilingual prompt (source+noisy target) with GPT-3.5/GPT-4 on a validation slice to see noise removal quality.
Filter or replace noisy references with LLM-cleaned versions and compare model gains on that eval set vs original.
Add noisy-source / clean-target training examples and measure relative BLEU gain to detect robustness improvements.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on proprietary GPT-3.5 API; cost and availability may limit large-scale cleaning.
- Evaluations and examples focus on English, French and Japanese; other languages untested.
- Human evaluation is small and performed by paper authors, not an external crowd.
- LLM-cleaned data can inherit biases from the LLM and original corpus.
When Not To Use
- When you must preserve original noisy target text verbatim (e.g., forensic analysis).
- For languages or dialects not represented in the paper (unknown LLM performance).
- If API cost, privacy, or licensing blocks sending data to an external LLM.
Failure Modes
- Monolingual cleaning can delete or add information and break alignment with the source (paper notes misalignment).
- LLM hallucination or stylistic changes that reduce faithfulness to source meaning.
- Lower effectiveness on Japanese: measured higher slang/profanity residuals and lower similarity scores.
- Biases in LLM outputs reproduce or amplify dataset bias.
Core Entities
Models
- GPT-3.5 (text-davinci-003)
- GPT-4
- Transformer (vanilla)
- BERT (used for augmentation)
Metrics
- BLEU
- LASER (sentence embedding cosine)
- Rouge-1
- Jaccard
- Jaro-Winkler
Datasets
- MTNT
- C-MTNT (new)
- Newstest2014
- TED
- KFTT
- JESC
- europarl-v7
- news-commentary v10
Benchmarks
- MTNT
- C-MTNT
- Newstest2014
- TED
- KFTT
- JESC
Context Entities
Models
- Llama 2 (mentioned as likely capable)
Datasets
- WMT15 dev/test (used for validation/testing contexts)

