Overview
Paper reports consistent multi-benchmark gains on internal Japanese tests and ablations (Swallow-Mistral and Sarashina-2), but uses internal data and Japanese-only tests so external replication requires access to similar data and LLMs.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Train EC models to be conservative on noisy auto-paired data to avoid risky, domain-blind edits that degrade real-world ASR. This reduces error rate in OOD scenarios without collecting new labeled data.
Who Should Care
Summary TLDR
ASR error-correction models trained on automatic ASR-hypothesis → gold-reference pairs suffer from noisy pairs that cause overcorrection, especially out-of-domain (OOD). The authors propose two filters: (C1) prefer targets that improve linguistic acceptability measured by an LLM likelihood ratio, and (C2) prefer targets inferable from source phonemes using a phoneme-conditioned EC model. For filtered pairs they replace the target with the source (train the model to copy), enforcing conservative behavior. On 21 Japanese OOD benchmarks, conservative filtering cuts correction frequency (e.g., %EC from ~43% → ~14% for Swallow-Mistral) and improves OOD robustness: Swallow-Mistral C1+C2 improves /
Problem Statement
Current LLM-based ASR error correction often makes unnecessary or wrong edits ('overcorrection') when trained on noisy automatic pairs. Noisy pairs include corrections that don't improve acceptability or cannot be inferred from ASR output. This brittleness is worst in zero-resource OOD settings where domain-specific fine-tuning is not available. Evidence: a substantial share of training pairs fail the proposed C1/C2 tests (see §3).
Main Contribution
Two practical filtering criteria for EC training pairs: C1 (target is more linguistically acceptable than source) and C2 (target is inferable from source phonemes). (§3)
Conservative training procedure: replace filtered (noisy) targets with the source so the model learns to copy on ambiguous examples. (Figure 1, §3)
Key Findings
Unfiltered EC training worsens OOD CER due to overcorrection.
Combined C1+C2 conservative filtering greatly reduces corrections and improves OOD robustness.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Swallow-Mistral 7B: No Filter vs C1+C2 (Avg over 21 tests) | No Filter: CER 12.51, %EC 43.0%; C1+C2: CER 11.69, %EC 13.9% | Original ASR CER 11.84 | No Filter worsens Avg CER (+0.67); C1+C2 improves Avg CER (-0.15) | 21 internal OOD test sets (macro avg) | Table 2 (Avg. row) | Table 2 |
| Swallow-Mistral 7B: C1+C2 improvement ratio | C1+C2 reduced corrections and improved CER on 71.4% (15/21) of test sets | Original ASR | Improved on 15 of 21 tests | 21 internal tests | Table 2 (< Orig. row) | Table 2 |
What To Try In 7 Days
Compute LLM log-likelihoods for source vs target (C1) on a sample of your EC pairs.
Train a lightweight phoneme-conditioned EC model and compute inferability scores (C2) on the same sample.
Replace flagged targets with sources and finetune an LLM EC model with LoRA for a few hundred steps; evaluate CER and %EC on held-out OOD sets.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments use internal Japanese training data and 21 internal test sets; external generalization unverified.
Only 1-best ASR hypothesis used; N-best or speech units not explored.
When Not To Use
When you can fine-tune EC on in-domain labeled examples — in-domain EC may outperform zero-resource approaches.
When you lack an LLM or phoneme extraction pipeline and cannot compute the filter signals.
Failure Modes
Over-conservatism: filter too many useful pairs and miss fixable errors.
Phoneme errors or poor phoneme alignment reduce inferability signal, misclassifying clean pairs.

