Filter noisy ASR correction pairs by two likelihood tests and train the model to be conservative

Overview

Decision SnapshotReady For Pilot

Paper reports consistent multi-benchmark gains on internal Japanese tests and ablations (Swallow-Mistral and Sarashina-2), but uses internal data and Japanese-only tests so external replication requires access to similar data and LLMs.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Takuma Udagawa, Masayuki Suzuki, Masayasu Muraoka, Gakuto Kurata

Links

Abstract / PDF / Code

Why It Matters For Business

Train EC models to be conservative on noisy auto-paired data to avoid risky, domain-blind edits that degrade real-world ASR. This reduces error rate in OOD scenarios without collecting new labeled data.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO

Summary TLDR

ASR error-correction models trained on automatic ASR-hypothesis → gold-reference pairs suffer from noisy pairs that cause overcorrection, especially out-of-domain (OOD). The authors propose two filters: (C1) prefer targets that improve linguistic acceptability measured by an LLM likelihood ratio, and (C2) prefer targets inferable from source phonemes using a phoneme-conditioned EC model. For filtered pairs they replace the target with the source (train the model to copy), enforcing conservative behavior. On 21 Japanese OOD benchmarks, conservative filtering cuts correction frequency (e.g., %EC from ~43% → ~14% for Swallow-Mistral) and improves OOD robustness: Swallow-Mistral C1+C2 improves /

Problem Statement

Current LLM-based ASR error correction often makes unnecessary or wrong edits ('overcorrection') when trained on noisy automatic pairs. Noisy pairs include corrections that don't improve acceptability or cannot be inferred from ASR output. This brittleness is worst in zero-resource OOD settings where domain-specific fine-tuning is not available. Evidence: a substantial share of training pairs fail the proposed C1/C2 tests (see §3).

Main Contribution

Two practical filtering criteria for EC training pairs: C1 (target is more linguistically acceptable than source) and C2 (target is inferable from source phonemes). (§3)

Conservative training procedure: replace filtered (noisy) targets with the source so the model learns to copy on ambiguous examples. (Figure 1, §3)

Key Findings

Unfiltered EC training worsens OOD CER due to overcorrection.

NumbersAvg CER 11.84 → 12.51; %EC 43.0% (Swallow-Mistral, Table 2)

Practical UseDon't train EC blindly on all auto-paired data — unfiltered pairs can degrade OOD performance.

Evidence RefTable 2; §5

Combined C1+C2 conservative filtering greatly reduces corrections and improves OOD robustness.

NumbersSwallow-Mistral: %EC 43.0% → 13.9%, Avg CER 11.84 → 11.69, improved on 71.4% (15/21) tests (Table 2)

Practical UseApplying both filters and training the model to copy filtered pairs yields fewer risky edits and better average OOD accuracy.

Evidence RefTable 2; §5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Swallow-Mistral 7B: No Filter vs C1+C2 (Avg over 21 tests)	No Filter: CER 12.51, %EC 43.0%; C1+C2: CER 11.69, %EC 13.9%	Original ASR CER 11.84	No Filter worsens Avg CER (+0.67); C1+C2 improves Avg CER (-0.15)	21 internal OOD test sets (macro avg)	Table 2 (Avg. row)	Table 2
Swallow-Mistral 7B: C1+C2 improvement ratio	C1+C2 reduced corrections and improved CER on 71.4% (15/21) of test sets	Original ASR	Improved on 15 of 21 tests	21 internal tests	Table 2 (< Orig. row)	Table 2

What To Try In 7 Days

Compute LLM log-likelihoods for source vs target (C1) on a sample of your EC pairs.

Train a lightweight phoneme-conditioned EC model and compute inferability scores (C2) on the same sample.

Replace flagged targets with sources and finetune an LLM EC model with LoRA for a few hundred steps; evaluate CER and %EC on held-out OOD sets.

Optimization Features

Training Optimization

LoRA

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/tokyotech-llm/Swallow-MS-7b-v0.1 https://huggingface.co/sbintuitions/sarashina2-7b

Risks & Boundaries

Limitations

Experiments use internal Japanese training data and 21 internal test sets; external generalization unverified.

Only 1-best ASR hypothesis used; N-best or speech units not explored.

When Not To Use

When you can fine-tune EC on in-domain labeled examples — in-domain EC may outperform zero-resource approaches.

When you lack an LLM or phoneme extraction pipeline and cannot compute the filter signals.

Failure Modes

Over-conservatism: filter too many useful pairs and miss fixable errors.

Phoneme errors or poor phoneme alignment reduce inferability signal, misclassifying clean pairs.

Core Entities

Models

Swallow-Mistral 7BSarashina-2 7BConformer-CTC (internal ASR baseline)DeBERTa V2 large (Japanese) for %LA scoring

Metrics

CER (character error rate)%EC (percent of hypotheses altered after EC)%LA (percent improved linguistic acceptability by masked-LM score)

Datasets

ASR training data (~8000 hours transcribed speech, internal)21 internal test benchmarks (various domains, Table 5)

Benchmarks

21 internal benchmarks (business, lectures, presentations, news, customer support, etc.)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Unfiltered EC training worsens OOD CER due to overcorrection.

Combined C1+C2 conservative filtering greatly reduces corrections and improves OOD robustness.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically find and remove hallucinations in machine-generated visual instructions to make multi-modal LLMs more accurate.

Key finding

Adaptive system that detects and masks personal data to meet GDPR and CCPA rules

Key finding

Pick fine‑tuning data by clustering loss curves of a small proxy model

Key finding

Use a multi-agent LLM pipeline to synthesize 30–90K high‑quality math QA that let 3–8B models match or beat models trained on 400K–2.3M

Key finding

RedWhale: adapt an English LLM to Korean with small-data continual pretraining and tokenizer tweaks

Key finding