Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Train EC models to be conservative on noisy auto-paired data to avoid risky, domain-blind edits that degrade real-world ASR. This reduces error rate in OOD scenarios without collecting new labeled data.
Summary TLDR
ASR error-correction models trained on automatic ASR-hypothesis → gold-reference pairs suffer from noisy pairs that cause overcorrection, especially out-of-domain (OOD). The authors propose two filters: (C1) prefer targets that improve linguistic acceptability measured by an LLM likelihood ratio, and (C2) prefer targets inferable from source phonemes using a phoneme-conditioned EC model. For filtered pairs they replace the target with the source (train the model to copy), enforcing conservative behavior. On 21 Japanese OOD benchmarks, conservative filtering cuts correction frequency (e.g., %EC from ~43% → ~14% for Swallow-Mistral) and improves OOD robustness: Swallow-Mistral C1+C2 improves /
Problem Statement
Current LLM-based ASR error correction often makes unnecessary or wrong edits ('overcorrection') when trained on noisy automatic pairs. Noisy pairs include corrections that don't improve acceptability or cannot be inferred from ASR output. This brittleness is worst in zero-resource OOD settings where domain-specific fine-tuning is not available. Evidence: a substantial share of training pairs fail the proposed C1/C2 tests (see §3).
Main Contribution
Two practical filtering criteria for EC training pairs: C1 (target is more linguistically acceptable than source) and C2 (target is inferable from source phonemes). (§3)
Conservative training procedure: replace filtered (noisy) targets with the source so the model learns to copy on ambiguous examples. (Figure 1, §3)
Large-scale Japanese experiments (Conformer-CTC baseline; Swallow-Mistral and Sarashina-2 7B LLMs) on 21 OOD benchmarks showing reduced overcorrection and improved OOD CER. (Tables 2,7)
Key Findings
Unfiltered EC training worsens OOD CER due to overcorrection.
Combined C1+C2 conservative filtering greatly reduces corrections and improves OOD robustness.
Using a stronger LLM yields larger gains but same pattern.
Simple edit-distance filtering is ineffective for OOD robustness.
A sizable fraction of training pairs are noisy by these criteria.
Results
Swallow-Mistral 7B: No Filter vs C1+C2 (Avg over 21 tests)
Swallow-Mistral 7B: C1+C2 improvement ratio
Sarashina-2 7B: Best runs
Edit-distance filtering baseline
Noisy-pair prevalence in training
Who Should Care
What To Try In 7 Days
Compute LLM log-likelihoods for source vs target (C1) on a sample of your EC pairs.
Train a lightweight phoneme-conditioned EC model and compute inferability scores (C2) on the same sample.
Replace flagged targets with sources and finetune an LLM EC model with LoRA for a few hundred steps; evaluate CER and %EC on held-out OOD sets.
Optimization Features
Training Optimization
- LoRA
Reproducibility
Code Urls
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use internal Japanese training data and 21 internal test sets; external generalization unverified.
- Only 1-best ASR hypothesis used; N-best or speech units not explored.
- Filtering thresholds (c1, c2) need tuning per language/model.
- Method requires an LLM and phoneme extractor; compute and tooling cost may be nontrivial.
When Not To Use
- When you can fine-tune EC on in-domain labeled examples — in-domain EC may outperform zero-resource approaches.
- When you lack an LLM or phoneme extraction pipeline and cannot compute the filter signals.
- When aggressive correction is desired (e.g., high-recall editing), conservative behavior may be too cautious.
Failure Modes
- Over-conservatism: filter too many useful pairs and miss fixable errors.
- Phoneme errors or poor phoneme alignment reduce inferability signal, misclassifying clean pairs.
- Thresholds set poorly cause either continued overcorrection or excessive copying.
Core Entities
Models
- Swallow-Mistral 7B
- Sarashina-2 7B
- Conformer-CTC (internal ASR baseline)
- DeBERTa V2 large (Japanese) for %LA scoring
Metrics
- CER (character error rate)
- %EC (percent of hypotheses altered after EC)
- %LA (percent improved linguistic acceptability by masked-LM score)
Datasets
- ASR training data (~8000 hours transcribed speech, internal)
- 21 internal test benchmarks (various domains, Table 5)
Benchmarks
- 21 internal benchmarks (business, lectures, presentations, news, customer support, etc.)

