MoST: a modality-aware Mixture-of-Experts that mixes speech and text in one LLM

Overview

Decision SnapshotReady For Pilot

Architecture changes are practical and validated by controlled ablations and multiple benchmarks; code and data releases make reproduction feasible.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Yuxuan Lou, Kai Yang, Yang You

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MoST shows modality-aware MoE can improve speech/text quality while using only open data and sparse activation, enabling efficient production systems for ASR, TTS, and spoken QA.

Who Should Care

ML Engineer Product Manager CTO Engineering Lead Data Scientist

Summary TLDR

MoST builds a speech+text large model by adding a Modality-Aware Mixture-of-Experts (MAMoE): separate expert groups for audio and text plus shared experts for cross-modal transfer. The team adapts a pretrained MoE LLM with two-stage training (ASR/TTS post-training, then mixed speech-text instruction fine-tuning) using only open datasets. On evaluated benchmarks MoST matches or beats similar-size open models (e.g., ASR: 2.0% WER on LibriSpeech-clean; audio-LM average ≈71.9%). Ablations show modality-aware routing and shared experts both drive gains. Code, checkpoints and data are released.

Problem Statement

Current multimodal LLMs often force speech and text through the same parameters, causing interference. We need an architecture that (1) gives each modality room to specialize and (2) allows safe cross-modal transfer while remaining compute-efficient and reproducible.

Main Contribution

Modality-Aware Mixture of Experts (MAMoE): partitioned text and audio expert groups plus shared experts and a modality-aware router.

An efficient two-stage LLM→speech-text pipeline: ASR/TTS post-training then mixed speech-text instruction fine-tuning using only open data.

Key Findings

MoST delivers competitive ASR and TTS accuracy on standard English benchmarks.

NumbersASR WER LS-Clean 2.0%, LS-Other 3.7%; TTS WER LS-Clean 6.0%

Practical UseUse MAMoE-style models to get strong speech recognition/synthesis from open data and MoE backbones without proprietary corpora.

Evidence RefTable 1

MoST improves audio-language modeling compared to many baselines.

NumbersAudio-LM average 71.94 (↑2.9% vs MinMo on reported average)

Practical UseIf your task needs spoken-language coherence, MAMoE can raise audio-LM accuracy modestly over prior open models.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASR WER (LibriSpeech-clean)	2.0%	Qwen2-Audio 1.8%	slightly worse than best open model	LibriSpeech-clean	Table 1 shows MoST 2.0% vs Qwen2-Audio 1.8%	Table 1
ASR WER (LibriSpeech-other)	3.7%	Qwen2-Audio 3.6%	comparable to top open models	LibriSpeech-other	Table 1 lists MoST 3.7% vs Qwen2-Audio 3.6%	Table 1

What To Try In 7 Days

Download MoST repo and run the supplied eval on your English ASR/TTS dev set to compare real-world performance.

Prototype modality-aware gating in an existing MoE upcycling path: add a modality flag and split experts into modality groups.

Use the provided instruction-mixing recipe (ASR/TTS + speech-instruct) to fine-tune a small MoE for better speech-text instruction following.

Agent Features

Memory

HuBERT encoder (frozen) for audio features

Frameworks

MAMoE

Architectures

MoETransformer decoder

Optimization Features

Token Efficiency

Claims data efficiency vs SpiritLM and Moshi (fewer training tokens to reach strong results)

Infra Optimization

Distributed training on 48 A100 GPUs (6 nodes × 8 GPUs)

Model Optimization

Sparse MoE increases parameter capacity with limited compute50% index-based expert partition for modality specialization

System Optimization

Shared experts preserve general capabilities during hard partition

Training Optimization

Two-stage: ASR/TTS post-training then mixed instruction fine-tuningTask mixing to avoid catastrophic forgetting

Inference Optimization

Sparse expert activation reduces per-token compute compared to dense models

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/anonymous29008/MoST/tree/main https://github.com/NUS-HPC-AI-Lab/MoST

Data URLs

https://commonvoice.mozilla.org/https://arxiv.org/abs/2309.08105 (LibriHeavy description)https://voxpopuli.example (VoxPopuli reference in paper)

Risks & Boundaries

Limitations

Expert partition uses a simple 50% index-based split; better initialization may improve results.

Evaluations focus on English and common public datasets; multilingual or low-resource behavior is untested.

When Not To Use

When you need a tiny on-device model: MoE sparsity reduces compute per token but model head and routing still assume server-class hardware.

When training data or evaluation language is not English: paper evaluates mainly on English corpora.

Failure Modes

Incorrect modality indicator could route tokens to the wrong expert group and degrade performance.

Hard partitioning might drop rare but important general knowledge if shared experts are insufficient.

Core Entities

Models

MoSTMAMoEDeepSeek-v2 LiteLlama3.2 3BHuBERTHifiGANVanilla MoE

Metrics

WERCERAccuracyNegative Log-Likelihood (NLL)

Datasets

LibriHeavyLibriSpeechCommon VoiceVoxPopuliRefinedWebSmolTalk

Benchmarks

ASRTTSsWUGGYsBLIMPsTopic-StoryClozesStoryClozeLlamaQTrivialQAWebQMMLUTriviaQAGSM8KHumanEval

Context Entities

Models

AudioLMSpeechGPTSpiritLMMoshiQwen2-AudioPhi-4 MultimodalSeamlessM4T-v2MinMoLLaMA-Omni2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MoST delivers competitive ASR and TTS accuracy on standard English benchmarks.

MoST improves audio-language modeling compared to many baselines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

WavLLM: dual-encoder LLaMA with prompt-aware LoRA for robust multi-task speech understanding

Key finding

SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

Key finding

Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

Key finding

LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

Key finding

Audio-aware LLMs (Gemini, GPT‑4o-audio) can judge speaking styles with human-like agreement

Key finding