Use GPT-3.5 to clean MTNT targets and build C-MTNT, a stronger noise benchmark

October 20, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Quinten Bolding, Baohao Liao, Brandon James Denis, Jun Luo, Christof Monz

Links

Abstract / PDF

Why It Matters For Business

Cleaning noisy reference translations with an LLM yields low-noise evaluation sets that better reveal whether models truly handle noisy input; this helps teams avoid optimistic robustness claims and focus training effort where it actually helps.

Summary TLDR

The authors apply GPT-3.5 prompts to clean noisy target sentences in the MTNT translation benchmark and release the cleaned dataset C-MTNT. They evaluate three prompt strategies (bilingual: source+noisy target, monolingual: noisy target only, translation: noisy source only). Human and GPT-4 judgments plus automatic measures show bilingual and translation methods reduce slang, emojis and profanities while preserving meaning. NMT models trained on augmented noisy sources show larger relative BLEU gains when evaluated on C-MTNT, suggesting it is a better benchmark of robustness than the original MTNT.

Problem Statement

MTNT contains natural noise on both source and target sides, which prevents using it to test whether an NMT model can translate noisy source text into a clean target. The paper aims to clean MTNT targets with an LLM so the dataset better measures robustness to noisy source input.

Main Contribution

Design three few-shot GPT-3.5 cleaning strategies: bilingual (use both source and noisy target), monolingual (target-only), and translation (source-only).

Create C-MTNT: cleaned target-side versions of MTNT for EN↔FR and EN→JA.

Quantify noise reduction (spelling/grammar, emojis, slang, profanities) and meaning preservation with automatic metrics and human/GPT-4 judgments.

Show NMT models trained with noisy-source / clean-target data get larger relative BLEU gains on C-MTNT vs MTNT, supporting C-MTNT as a noise-evaluation benchmark.

Key Findings

Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.

NumbersEN spell/gram per 100 toks: MTNT 1.712 → Bilingual 0.687; FR: MTNT 7.125 → Bilingual 0.552 (Table 2)

Cleaned sentences remain semantically close to originals under bilingual and monolingual methods.

NumbersLASER similarity: bilingual 0.94, monolingual 0.95, translation 0.89 (Fr→En; Table 5)

Human and GPT-4 preferences agree: bilingual > translation > monolingual for overall cleaning quality.

NumbersBinary preference trend shown in Figure 3a and 3b (majority preference for bilingual)

NMT models trained on noisy-source / clean-target data get higher relative BLEU gains on C-MTNT than on MTNT.

NumbersExample: bilingual C-MTNT relative gain G up to +12.2% vs baseline across augmentations (Table 3)

Results

Spelling/Grammar errors (per 100 tokens) in EN targets

ValueMTNT 1.712 → Bilingual 0.687

BaselineMTNT 1.712

Emojis (per 100 tokens) in EN targets

ValueMTNT 0.031 → Bilingual 0.0

BaselineMTNT 0.031

Semantic similarity (LASER)

ValueBilingual 0.94, Translation 0.89, Monolingual 0.95

BaselineCorrection-tool 0.99

Relative BLEU gain G (%) from robustness training

ValueBilingual C-MTNT up to +12.2% average G across augmentations

BaselineBLEU from base (no augmentation)

Who Should Care

What To Try In 7 Days

Run a small bilingual prompt (source+noisy target) with GPT-3.5/GPT-4 on a validation slice to see noise removal quality.

Filter or replace noisy references with LLM-cleaned versions and compare model gains on that eval set vs original.

Add noisy-source / clean-target training examples and measure relative BLEU gain to detect robustness improvements.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on proprietary GPT-3.5 API; cost and availability may limit large-scale cleaning.
  • Evaluations and examples focus on English, French and Japanese; other languages untested.
  • Human evaluation is small and performed by paper authors, not an external crowd.
  • LLM-cleaned data can inherit biases from the LLM and original corpus.

When Not To Use

  • When you must preserve original noisy target text verbatim (e.g., forensic analysis).
  • For languages or dialects not represented in the paper (unknown LLM performance).
  • If API cost, privacy, or licensing blocks sending data to an external LLM.

Failure Modes

  • Monolingual cleaning can delete or add information and break alignment with the source (paper notes misalignment).
  • LLM hallucination or stylistic changes that reduce faithfulness to source meaning.
  • Lower effectiveness on Japanese: measured higher slang/profanity residuals and lower similarity scores.
  • Biases in LLM outputs reproduce or amplify dataset bias.

Core Entities

Models

  • GPT-3.5 (text-davinci-003)
  • GPT-4
  • Transformer (vanilla)
  • BERT (used for augmentation)

Metrics

  • BLEU
  • LASER (sentence embedding cosine)
  • Rouge-1
  • Jaccard
  • Jaro-Winkler

Datasets

  • MTNT
  • C-MTNT (new)
  • Newstest2014
  • TED
  • KFTT
  • JESC
  • europarl-v7
  • news-commentary v10

Benchmarks

  • MTNT
  • C-MTNT
  • Newstest2014
  • TED
  • KFTT
  • JESC

Context Entities

Models

  • Llama 2 (mentioned as likely capable)

Datasets

  • WMT15 dev/test (used for validation/testing contexts)