Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
MoST shows modality-aware MoE can improve speech/text quality while using only open data and sparse activation, enabling efficient production systems for ASR, TTS, and spoken QA.
Summary TLDR
MoST builds a speech+text large model by adding a Modality-Aware Mixture-of-Experts (MAMoE): separate expert groups for audio and text plus shared experts for cross-modal transfer. The team adapts a pretrained MoE LLM with two-stage training (ASR/TTS post-training, then mixed speech-text instruction fine-tuning) using only open datasets. On evaluated benchmarks MoST matches or beats similar-size open models (e.g., ASR: 2.0% WER on LibriSpeech-clean; audio-LM average ≈71.9%). Ablations show modality-aware routing and shared experts both drive gains. Code, checkpoints and data are released.
Problem Statement
Current multimodal LLMs often force speech and text through the same parameters, causing interference. We need an architecture that (1) gives each modality room to specialize and (2) allows safe cross-modal transfer while remaining compute-efficient and reproducible.
Main Contribution
Modality-Aware Mixture of Experts (MAMoE): partitioned text and audio expert groups plus shared experts and a modality-aware router.
An efficient two-stage LLM→speech-text pipeline: ASR/TTS post-training then mixed speech-text instruction fine-tuning using only open data.
Full open release of model weights, training/inference code, and curated data to enable reproduction.
Key Findings
MoST delivers competitive ASR and TTS accuracy on standard English benchmarks.
MoST improves audio-language modeling compared to many baselines.
Spoken question answering shows notable gains versus competitors.
Ablations confirm architecture matters: modality-aware routing + shared experts give measurable gains.
Results
ASR WER (LibriSpeech-clean)
ASR WER (LibriSpeech-other)
TTS WER (LibriSpeech-clean)
Accuracy
Spoken QA (LlamaQ S→T)
Text-task (GSM8K)
Who Should Care
What To Try In 7 Days
Download MoST repo and run the supplied eval on your English ASR/TTS dev set to compare real-world performance.
Prototype modality-aware gating in an existing MoE upcycling path: add a modality flag and split experts into modality groups.
Use the provided instruction-mixing recipe (ASR/TTS + speech-instruct) to fine-tune a small MoE for better speech-text instruction following.
Agent Features
Memory
- HuBERT encoder (frozen) for audio features
Frameworks
- MAMoE
Architectures
- MoE
- Transformer decoder
Optimization Features
Token Efficiency
- Claims data efficiency vs SpiritLM and Moshi (fewer training tokens to reach strong results)
Infra Optimization
- Distributed training on 48 A100 GPUs (6 nodes × 8 GPUs)
Model Optimization
- Sparse MoE increases parameter capacity with limited compute
- 50% index-based expert partition for modality specialization
System Optimization
- Shared experts preserve general capabilities during hard partition
Training Optimization
- Two-stage: ASR/TTS post-training then mixed instruction fine-tuning
- Task mixing to avoid catastrophic forgetting
Inference Optimization
- Sparse expert activation reduces per-token compute compared to dense models
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Expert partition uses a simple 50% index-based split; better initialization may improve results.
- Evaluations focus on English and common public datasets; multilingual or low-resource behavior is untested.
- Risk of voice spoofing and misuse from high-quality TTS outputs.
When Not To Use
- When you need a tiny on-device model: MoE sparsity reduces compute per token but model head and routing still assume server-class hardware.
- When training data or evaluation language is not English: paper evaluates mainly on English corpora.
Failure Modes
- Incorrect modality indicator could route tokens to the wrong expert group and degrade performance.
- Hard partitioning might drop rare but important general knowledge if shared experts are insufficient.
- TTS outputs could be misused for spoofing or impersonation if not guarded.
Core Entities
Models
- MoST
- MAMoE
- DeepSeek-v2 Lite
- Llama3.2 3B
- HuBERT
- HifiGAN
- Vanilla MoE
Metrics
- WER
- CER
- Accuracy
- Negative Log-Likelihood (NLL)
Datasets
- LibriHeavy
- LibriSpeech
- Common Voice
- VoxPopuli
- RefinedWeb
- SmolTalk
Benchmarks
- ASR
- TTS
- sWUGGY
- sBLIMP
- sTopic-StoryCloze
- sStoryCloze
- LlamaQ
- TrivialQA
- WebQ
- MMLU
- TriviaQA
- GSM8K
- HumanEval
Context Entities
Models
- AudioLM
- SpeechGPT
- SpiritLM
- Moshi
- Qwen2-Audio
- Phi-4 Multimodal
- SeamlessM4T-v2
- MinMo
- LLaMA-Omni2

