Use GPT-3.5 to clean MTNT targets and build C-MTNT, a stronger noise benchmark

Overview

Decision SnapshotNeeds Validation

The idea is practical and validated with automatic metrics, human and GPT-4 checks and MT experiments, but relies on proprietary LLM calls, limited human annotation, and tests on EN/FR/JA only.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Quinten Bolding, Baohao Liao, Brandon James Denis, Jun Luo, Christof Monz

Links

Abstract / PDF

Why It Matters For Business

Cleaning noisy reference translations with an LLM yields low-noise evaluation sets that better reveal whether models truly handle noisy input; this helps teams avoid optimistic robustness claims and focus training effort where it actually helps.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO Product Manager

Summary TLDR

The authors apply GPT-3.5 prompts to clean noisy target sentences in the MTNT translation benchmark and release the cleaned dataset C-MTNT. They evaluate three prompt strategies (bilingual: source+noisy target, monolingual: noisy target only, translation: noisy source only). Human and GPT-4 judgments plus automatic measures show bilingual and translation methods reduce slang, emojis and profanities while preserving meaning. NMT models trained on augmented noisy sources show larger relative BLEU gains when evaluated on C-MTNT, suggesting it is a better benchmark of robustness than the original MTNT.

Problem Statement

MTNT contains natural noise on both source and target sides, which prevents using it to test whether an NMT model can translate noisy source text into a clean target. The paper aims to clean MTNT targets with an LLM so the dataset better measures robustness to noisy source input.

Main Contribution

Design three few-shot GPT-3.5 cleaning strategies: bilingual (use both source and noisy target), monolingual (target-only), and translation (source-only).

Create C-MTNT: cleaned target-side versions of MTNT for EN↔FR and EN→JA.

Key Findings

Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.

NumbersEN spell/gram per 100 toks: MTNT 1.712 → Bilingual 0.687; FR: MTNT 7.125 → Bilingual 0.552 (Table 2)

Practical UseUse bilingual or translation prompts to clean references when you need low-noise target references for robustness tests.

Evidence RefTable 2

Cleaned sentences remain semantically close to originals under bilingual and monolingual methods.

NumbersLASER similarity: bilingual 0.94, monolingual 0.95, translation 0.89 (Fr→En; Table 5)

Practical UseLLM cleaning preserves meaning enough for evaluation if you choose bilingual or monolingual prompting and check embedding similarity.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Spelling/Grammar errors (per 100 tokens) in EN targets	MTNT 1.712 → Bilingual 0.687	MTNT 1.712	-1.025 per 100 toks	EN targets (Table 2)	Table 2 reports counts per 100 tokens	Table 2
Emojis (per 100 tokens) in EN targets	MTNT 0.031 → Bilingual 0.0	MTNT 0.031	-0.031 per 100 toks	EN targets (Table 2)	Table 2 shows bilingual cleaning removes emojis in EN targets	Table 2

What To Try In 7 Days

Run a small bilingual prompt (source+noisy target) with GPT-3.5/GPT-4 on a validation slice to see noise removal quality.

Filter or replace noisy references with LLM-cleaned versions and compare model gains on that eval set vs original.

Add noisy-source / clean-target training examples and measure relative BLEU gain to detect robustness improvements.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relies on proprietary GPT-3.5 API; cost and availability may limit large-scale cleaning.

Evaluations and examples focus on English, French and Japanese; other languages untested.

When Not To Use

When you must preserve original noisy target text verbatim (e.g., forensic analysis).

For languages or dialects not represented in the paper (unknown LLM performance).

Failure Modes

Monolingual cleaning can delete or add information and break alignment with the source (paper notes misalignment).

LLM hallucination or stylistic changes that reduce faithfulness to source meaning.

Core Entities

Models

GPT-3.5 (text-davinci-003)GPT-4Transformer (vanilla)BERT (used for augmentation)

Metrics

BLEULASER (sentence embedding cosine)Rouge-1JaccardJaro-Winkler

Datasets

MTNTC-MTNT (new)Newstest2014TEDKFTTJESCeuroparl-v7news-commentary v10

Benchmarks

MTNTC-MTNTNewstest2014TEDKFTTJESC

Context Entities

Models

Llama 2 (mentioned as likely capable)

Datasets

WMT15 dev/test (used for validation/testing contexts)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Bilingual and translation cleaning reduce target-side noise much more than the rule-based correction tool for EN and FR.

Cleaned sentences remain semantically close to originals under bilingual and monolingual methods.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding