Filter noisy ASR correction pairs by two likelihood tests and train the model to be conservative

July 18, 20248 min

Overview

Decision SnapshotReady For Pilot

Paper reports consistent multi-benchmark gains on internal Japanese tests and ablations (Swallow-Mistral and Sarashina-2), but uses internal data and Japanese-only tests so external replication requires access to similar data and LLMs.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Takuma Udagawa, Masayuki Suzuki, Masayasu Muraoka, Gakuto Kurata

Links

Abstract / PDF / Code

Why It Matters For Business

Train EC models to be conservative on noisy auto-paired data to avoid risky, domain-blind edits that degrade real-world ASR. This reduces error rate in OOD scenarios without collecting new labeled data.

Who Should Care

Summary TLDR

ASR error-correction models trained on automatic ASR-hypothesis → gold-reference pairs suffer from noisy pairs that cause overcorrection, especially out-of-domain (OOD). The authors propose two filters: (C1) prefer targets that improve linguistic acceptability measured by an LLM likelihood ratio, and (C2) prefer targets inferable from source phonemes using a phoneme-conditioned EC model. For filtered pairs they replace the target with the source (train the model to copy), enforcing conservative behavior. On 21 Japanese OOD benchmarks, conservative filtering cuts correction frequency (e.g., %EC from ~43% → ~14% for Swallow-Mistral) and improves OOD robustness: Swallow-Mistral C1+C2 improves /

Problem Statement

Current LLM-based ASR error correction often makes unnecessary or wrong edits ('overcorrection') when trained on noisy automatic pairs. Noisy pairs include corrections that don't improve acceptability or cannot be inferred from ASR output. This brittleness is worst in zero-resource OOD settings where domain-specific fine-tuning is not available. Evidence: a substantial share of training pairs fail the proposed C1/C2 tests (see §3).

Main Contribution

Two practical filtering criteria for EC training pairs: C1 (target is more linguistically acceptable than source) and C2 (target is inferable from source phonemes). (§3)

Conservative training procedure: replace filtered (noisy) targets with the source so the model learns to copy on ambiguous examples. (Figure 1, §3)

Key Findings

Unfiltered EC training worsens OOD CER due to overcorrection.

NumbersAvg CER 11.8412.51; %EC 43.0% (Swallow-Mistral, Table 2)

Practical UseDon't train EC blindly on all auto-paired data — unfiltered pairs can degrade OOD performance.

Evidence RefTable 2; §5

Combined C1+C2 conservative filtering greatly reduces corrections and improves OOD robustness.

NumbersSwallow-Mistral: %EC 43.0%13.9%, Avg CER 11.8411.69, improved on 71.4% (15/21) tests (Table 2)

Practical UseApplying both filters and training the model to copy filtered pairs yields fewer risky edits and better average OOD accuracy.

Evidence RefTable 2; §5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Swallow-Mistral 7B: No Filter vs C1+C2 (Avg over 21 tests)No Filter: CER 12.51, %EC 43.0%; C1+C2: CER 11.69, %EC 13.9%Original ASR CER 11.84No Filter worsens Avg CER (+0.67); C1+C2 improves Avg CER (-0.15)21 internal OOD test sets (macro avg)Table 2 (Avg. row)Table 2
Swallow-Mistral 7B: C1+C2 improvement ratioC1+C2 reduced corrections and improved CER on 71.4% (15/21) of test setsOriginal ASRImproved on 15 of 21 tests21 internal testsTable 2 (< Orig. row)Table 2

What To Try In 7 Days

Compute LLM log-likelihoods for source vs target (C1) on a sample of your EC pairs.

Train a lightweight phoneme-conditioned EC model and compute inferability scores (C2) on the same sample.

Replace flagged targets with sources and finetune an LLM EC model with LoRA for a few hundred steps; evaluate CER and %EC on held-out OOD sets.

Optimization Features

Training Optimization
LoRA

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use internal Japanese training data and 21 internal test sets; external generalization unverified.

Only 1-best ASR hypothesis used; N-best or speech units not explored.

When Not To Use

When you can fine-tune EC on in-domain labeled examples — in-domain EC may outperform zero-resource approaches.

When you lack an LLM or phoneme extraction pipeline and cannot compute the filter signals.

Failure Modes

Over-conservatism: filter too many useful pairs and miss fixable errors.

Phoneme errors or poor phoneme alignment reduce inferability signal, misclassifying clean pairs.

Core Entities

Models

Swallow-Mistral 7BSarashina-2 7BConformer-CTC (internal ASR baseline)DeBERTa V2 large (Japanese) for %LA scoring

Metrics

CER (character error rate)%EC (percent of hypotheses altered after EC)%LA (percent improved linguistic acceptability by masked-LM score)

Datasets

ASR training data (~8000 hours transcribed speech, internal)21 internal test benchmarks (various domains, Table 5)

Benchmarks

21 internal benchmarks (business, lectures, presentations, news, customer support, etc.)