Filter noisy ASR correction pairs by two likelihood tests and train the model to be conservative

July 18, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Takuma Udagawa, Masayuki Suzuki, Masayasu Muraoka, Gakuto Kurata

Links

Abstract / PDF

Why It Matters For Business

Train EC models to be conservative on noisy auto-paired data to avoid risky, domain-blind edits that degrade real-world ASR. This reduces error rate in OOD scenarios without collecting new labeled data.

Summary TLDR

ASR error-correction models trained on automatic ASR-hypothesis → gold-reference pairs suffer from noisy pairs that cause overcorrection, especially out-of-domain (OOD). The authors propose two filters: (C1) prefer targets that improve linguistic acceptability measured by an LLM likelihood ratio, and (C2) prefer targets inferable from source phonemes using a phoneme-conditioned EC model. For filtered pairs they replace the target with the source (train the model to copy), enforcing conservative behavior. On 21 Japanese OOD benchmarks, conservative filtering cuts correction frequency (e.g., %EC from ~43% → ~14% for Swallow-Mistral) and improves OOD robustness: Swallow-Mistral C1+C2 improves /

Problem Statement

Current LLM-based ASR error correction often makes unnecessary or wrong edits ('overcorrection') when trained on noisy automatic pairs. Noisy pairs include corrections that don't improve acceptability or cannot be inferred from ASR output. This brittleness is worst in zero-resource OOD settings where domain-specific fine-tuning is not available. Evidence: a substantial share of training pairs fail the proposed C1/C2 tests (see §3).

Main Contribution

Two practical filtering criteria for EC training pairs: C1 (target is more linguistically acceptable than source) and C2 (target is inferable from source phonemes). (§3)

Conservative training procedure: replace filtered (noisy) targets with the source so the model learns to copy on ambiguous examples. (Figure 1, §3)

Large-scale Japanese experiments (Conformer-CTC baseline; Swallow-Mistral and Sarashina-2 7B LLMs) on 21 OOD benchmarks showing reduced overcorrection and improved OOD CER. (Tables 2,7)

Key Findings

Unfiltered EC training worsens OOD CER due to overcorrection.

NumbersAvg CER 11.84 → 12.51; %EC 43.0% (Swallow-Mistral, Table 2)

Combined C1+C2 conservative filtering greatly reduces corrections and improves OOD robustness.

NumbersSwallow-Mistral: %EC 43.0% → 13.9%, Avg CER 11.84 → 11.69, improved on 71.4% (15/21) tests (Table 2)

Using a stronger LLM yields larger gains but same pattern.

NumbersSarashina-2: best-case Avg CER 11.41 and C1+C2 improved CER in 85.7% (18/21) tests (Table 7, §C)

Simple edit-distance filtering is ineffective for OOD robustness.

NumbersEdit-distance filter (0.5) Avg CER 12.65 and improved on 42.9% tests (Table 3 vs Table 2)

A sizable fraction of training pairs are noisy by these criteria.

NumbersOf effective pairs: 34% flagged by C1, 33% by C2, 42% by both (§3)

Results

Swallow-Mistral 7B: No Filter vs C1+C2 (Avg over 21 tests)

ValueNo Filter: CER 12.51, %EC 43.0%; C1+C2: CER 11.69, %EC 13.9%

BaselineOriginal ASR CER 11.84

Swallow-Mistral 7B: C1+C2 improvement ratio

ValueC1+C2 reduced corrections and improved CER on 71.4% (15/21) of test sets

BaselineOriginal ASR

Sarashina-2 7B: Best runs

ValueBest: Avg CER 11.41 (C1-only); C1+C2 Avg CER 11.47 and improved on 85.7% (18/21)

BaselineOriginal ASR CER 11.84

Edit-distance filtering baseline

ValueEdit-dist (norm 0.5) Avg CER 12.65; improved on 42.9% tests

BaselineOriginal ASR CER 11.84

Noisy-pair prevalence in training

ValueASR exact-match rate ~34%; of effective pairs, 34% flagged by C1, 33% by C2, 42% by both

BaselineFull training set

Who Should Care

What To Try In 7 Days

Compute LLM log-likelihoods for source vs target (C1) on a sample of your EC pairs.

Train a lightweight phoneme-conditioned EC model and compute inferability scores (C2) on the same sample.

Replace flagged targets with sources and finetune an LLM EC model with LoRA for a few hundred steps; evaluate CER and %EC on held-out OOD sets.

Optimization Features

Training Optimization

  • LoRA

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use internal Japanese training data and 21 internal test sets; external generalization unverified.
  • Only 1-best ASR hypothesis used; N-best or speech units not explored.
  • Filtering thresholds (c1, c2) need tuning per language/model.
  • Method requires an LLM and phoneme extractor; compute and tooling cost may be nontrivial.

When Not To Use

  • When you can fine-tune EC on in-domain labeled examples — in-domain EC may outperform zero-resource approaches.
  • When you lack an LLM or phoneme extraction pipeline and cannot compute the filter signals.
  • When aggressive correction is desired (e.g., high-recall editing), conservative behavior may be too cautious.

Failure Modes

  • Over-conservatism: filter too many useful pairs and miss fixable errors.
  • Phoneme errors or poor phoneme alignment reduce inferability signal, misclassifying clean pairs.
  • Thresholds set poorly cause either continued overcorrection or excessive copying.

Core Entities

Models

  • Swallow-Mistral 7B
  • Sarashina-2 7B
  • Conformer-CTC (internal ASR baseline)
  • DeBERTa V2 large (Japanese) for %LA scoring

Metrics

  • CER (character error rate)
  • %EC (percent of hypotheses altered after EC)
  • %LA (percent improved linguistic acceptability by masked-LM score)

Datasets

  • ASR training data (~8000 hours transcribed speech, internal)
  • 21 internal test benchmarks (various domains, Table 5)

Benchmarks

  • 21 internal benchmarks (business, lectures, presentations, news, customer support, etc.)