Overview
Architecture changes are practical and validated by controlled ablations and multiple benchmarks; code and data releases make reproduction feasible.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
MoST shows modality-aware MoE can improve speech/text quality while using only open data and sparse activation, enabling efficient production systems for ASR, TTS, and spoken QA.
Who Should Care
Summary TLDR
MoST builds a speech+text large model by adding a Modality-Aware Mixture-of-Experts (MAMoE): separate expert groups for audio and text plus shared experts for cross-modal transfer. The team adapts a pretrained MoE LLM with two-stage training (ASR/TTS post-training, then mixed speech-text instruction fine-tuning) using only open datasets. On evaluated benchmarks MoST matches or beats similar-size open models (e.g., ASR: 2.0% WER on LibriSpeech-clean; audio-LM average ≈71.9%). Ablations show modality-aware routing and shared experts both drive gains. Code, checkpoints and data are released.
Problem Statement
Current multimodal LLMs often force speech and text through the same parameters, causing interference. We need an architecture that (1) gives each modality room to specialize and (2) allows safe cross-modal transfer while remaining compute-efficient and reproducible.
Main Contribution
Modality-Aware Mixture of Experts (MAMoE): partitioned text and audio expert groups plus shared experts and a modality-aware router.
An efficient two-stage LLM→speech-text pipeline: ASR/TTS post-training then mixed speech-text instruction fine-tuning using only open data.
Key Findings
MoST delivers competitive ASR and TTS accuracy on standard English benchmarks.
MoST improves audio-language modeling compared to many baselines.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASR WER (LibriSpeech-clean) | 2.0% | Qwen2-Audio 1.8% | slightly worse than best open model | LibriSpeech-clean | Table 1 shows MoST 2.0% vs Qwen2-Audio 1.8% | Table 1 |
| ASR WER (LibriSpeech-other) | 3.7% | Qwen2-Audio 3.6% | comparable to top open models | LibriSpeech-other | Table 1 lists MoST 3.7% vs Qwen2-Audio 3.6% | Table 1 |
What To Try In 7 Days
Download MoST repo and run the supplied eval on your English ASR/TTS dev set to compare real-world performance.
Prototype modality-aware gating in an existing MoE upcycling path: add a modality flag and split experts into modality groups.
Use the provided instruction-mixing recipe (ASR/TTS + speech-instruct) to fine-tune a small MoE for better speech-text instruction following.
Agent Features
Memory
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Expert partition uses a simple 50% index-based split; better initialization may improve results.
Evaluations focus on English and common public datasets; multilingual or low-resource behavior is untested.
When Not To Use
When you need a tiny on-device model: MoE sparsity reduces compute per token but model head and routing still assume server-class hardware.
When training data or evaluation language is not English: paper evaluates mainly on English corpora.
Failure Modes
Incorrect modality indicator could route tokens to the wrong expert group and degrade performance.
Hard partitioning might drop rare but important general knowledge if shared experts are insufficient.

