MoST: a modality-aware Mixture-of-Experts that mixes speech and text in one LLM

January 15, 20267 min

Overview

Decision SnapshotReady For Pilot

Architecture changes are practical and validated by controlled ablations and multiple benchmarks; code and data releases make reproduction feasible.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Yuxuan Lou, Kai Yang, Yang You

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MoST shows modality-aware MoE can improve speech/text quality while using only open data and sparse activation, enabling efficient production systems for ASR, TTS, and spoken QA.

Who Should Care

Summary TLDR

MoST builds a speech+text large model by adding a Modality-Aware Mixture-of-Experts (MAMoE): separate expert groups for audio and text plus shared experts for cross-modal transfer. The team adapts a pretrained MoE LLM with two-stage training (ASR/TTS post-training, then mixed speech-text instruction fine-tuning) using only open datasets. On evaluated benchmarks MoST matches or beats similar-size open models (e.g., ASR: 2.0% WER on LibriSpeech-clean; audio-LM average ≈71.9%). Ablations show modality-aware routing and shared experts both drive gains. Code, checkpoints and data are released.

Problem Statement

Current multimodal LLMs often force speech and text through the same parameters, causing interference. We need an architecture that (1) gives each modality room to specialize and (2) allows safe cross-modal transfer while remaining compute-efficient and reproducible.

Main Contribution

Modality-Aware Mixture of Experts (MAMoE): partitioned text and audio expert groups plus shared experts and a modality-aware router.

An efficient two-stage LLM→speech-text pipeline: ASR/TTS post-training then mixed speech-text instruction fine-tuning using only open data.

Key Findings

MoST delivers competitive ASR and TTS accuracy on standard English benchmarks.

NumbersASR WER LS-Clean 2.0%, LS-Other 3.7%; TTS WER LS-Clean 6.0%

Practical UseUse MAMoE-style models to get strong speech recognition/synthesis from open data and MoE backbones without proprietary corpora.

Evidence RefTable 1

MoST improves audio-language modeling compared to many baselines.

NumbersAudio-LM average 71.94 (↑2.9% vs MinMo on reported average)

Practical UseIf your task needs spoken-language coherence, MAMoE can raise audio-LM accuracy modestly over prior open models.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASR WER (LibriSpeech-clean)2.0%Qwen2-Audio 1.8%slightly worse than best open modelLibriSpeech-cleanTable 1 shows MoST 2.0% vs Qwen2-Audio 1.8%Table 1
ASR WER (LibriSpeech-other)3.7%Qwen2-Audio 3.6%comparable to top open modelsLibriSpeech-otherTable 1 lists MoST 3.7% vs Qwen2-Audio 3.6%Table 1

What To Try In 7 Days

Download MoST repo and run the supplied eval on your English ASR/TTS dev set to compare real-world performance.

Prototype modality-aware gating in an existing MoE upcycling path: add a modality flag and split experts into modality groups.

Use the provided instruction-mixing recipe (ASR/TTS + speech-instruct) to fine-tune a small MoE for better speech-text instruction following.

Agent Features

Memory
HuBERT encoder (frozen) for audio features
Frameworks
MAMoE
Architectures
MoETransformer decoder

Optimization Features

Token Efficiency
Claims data efficiency vs SpiritLM and Moshi (fewer training tokens to reach strong results)
Infra Optimization
Distributed training on 48 A100 GPUs (6 nodes × 8 GPUs)
Model Optimization
Sparse MoE increases parameter capacity with limited compute50% index-based expert partition for modality specialization
System Optimization
Shared experts preserve general capabilities during hard partition
Training Optimization
Two-stage: ASR/TTS post-training then mixed instruction fine-tuningTask mixing to avoid catastrophic forgetting
Inference Optimization
Sparse expert activation reduces per-token compute compared to dense models

Reproducibility

Risks & Boundaries

Limitations

Expert partition uses a simple 50% index-based split; better initialization may improve results.

Evaluations focus on English and common public datasets; multilingual or low-resource behavior is untested.

When Not To Use

When you need a tiny on-device model: MoE sparsity reduces compute per token but model head and routing still assume server-class hardware.

When training data or evaluation language is not English: paper evaluates mainly on English corpora.

Failure Modes

Incorrect modality indicator could route tokens to the wrong expert group and degrade performance.

Hard partitioning might drop rare but important general knowledge if shared experts are insufficient.

Core Entities

Models

MoSTMAMoEDeepSeek-v2 LiteLlama3.2 3BHuBERTHifiGANVanilla MoE

Metrics

WERCERAccuracyNegative Log-Likelihood (NLL)

Datasets

LibriHeavyLibriSpeechCommon VoiceVoxPopuliRefinedWebSmolTalk

Benchmarks

ASRTTSsWUGGYsBLIMPsTopic-StoryClozesStoryClozeLlamaQTrivialQAWebQMMLUTriviaQAGSM8KHumanEval

Context Entities

Models

AudioLMSpeechGPTSpiritLMMoshiQwen2-AudioPhi-4 MultimodalSeamlessM4T-v2MinMoLLaMA-Omni2