Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
SpeechSSM lets teams generate multi-minute, coherent speech with fixed memory and much faster decoding, lowering infrastructure and latency costs for audiobook, podcast, or voice-agent products.
Summary TLDR
The paper introduces SpeechSSM, a family of spoken language models built with hybrid state-space layers to generate speech for many minutes in a single decoding session without text intermediates. SpeechSSM pairs a semantic tokenizer (USM-v2) with a speaker-conditioned acoustic stage (SoundStorm + SoundStream) and windowed tokenization for continuous decoding. The models (2B and 9B) match prior spoken LMs on short clips and substantially outperform Transformer baselines on multi-minute coherence, while using constant memory at decode time and offering much higher throughput. The authors also release LibriSpeech-Long (long-form evaluation splits) and advocate embedding-based and LLM-judged,時間
Problem Statement
Current spoken language models fail to keep semantic and speaker coherence for generations beyond tens of seconds because speech token rates are high, Transformers scale poorly to long token sequences, and memory and compute blow up at inference. The paper asks: can we build a spoken LM that (1) generates arbitrarily long speech in bounded memory, (2) extrapolates quality beyond training lengths, and (3) can be evaluated with suitable long-form benchmarks?
Main Contribution
SpeechSSM: first spoken LM family using hybrid state-space layers (Griffin mix of gated recurrences + local attention) to generate unbounded long-form speech in fixed memory.
Practical pipeline: USM-v2 semantic tokens + speaker-prompted SoundStorm→SoundStream acoustic stage and windowed tokenization/decoding.
LibriSpeech-Long: a 4min-target reprocessing of LibriSpeech dev/test to enable long-form reference evaluations.
New long-form evaluation suite: embedding-based semantic metrics (Gecko), LLM-as-judge side-by-side win-rates, and time-stratified metrics (N-MOST, SCL).
Extemporaneous variant SpeechSSM-X trained on 216k hours of informal monologues for spontaneous-style generation.
Key Findings
SpeechSSM-2B achieves better long-form transcript perplexity (ASR PPL) than baselines on 4min continuations.
SpeechSSM wins more side-by-side transcript judgments versus other models on 4min continuations.
SpeechSSM maintains perceived naturalness across minutes while many baselines degrade rapidly.
Decoding efficiency and memory are vastly improved versus a Transformer LM.
Windowed extension (slide-and-prompt) of short-sequence models is a poor substitute for an SSM trained for long sequences.
Results
ASR PPL (4min continuations)
LLM-as-judge win rate vs baseline models (4min)
Naturalness MOS over time (N-MOST)
Throughput / decoding speed
16min unconditional PPL
Who Should Care
What To Try In 7 Days
Download LibriSpeech-Long and rerun your long-continuation evaluations using Gecko embeddings and LLM side-by-side judging.
Prototype a hybrid SSM decoder (or use Griffin-like blocks) for long audio tokens to test throughput and memory benefits on your hardware.
Adopt windowed tokenization plus a speaker-conditioned acoustic stage (SoundStorm→SoundStream) to keep speaker identity when extending short models.
Agent Features
Memory
- constant-size recurrent state at decode time
- windowed tokenization to bound tokenizer/decoder memory
Architectures
- hybrid state-space model (SSM + local attention)
- Griffin: gated LRUs interleaved with local multi-query attention
- two-stage pipeline: semantic tokens → speaker-conditioned acoustic tokens
Optimization Features
Token Efficiency
- USM-v2 semantic tokens at 25Hz (fixed-rate pseudo-text, 32k vocab)
Infra Optimization
- demonstrated decoding 16.4k tokens (~10.9 min) in ~100s on TPU v5e
Model Optimization
- initialize from text-pretrained RecurrentGemma to transfer LM knowledge
- remove explicit positional encodings (NoPE) to help length extrapolation
System Optimization
- higher throughput on TPU v5e with batch decoding (>120× vs Transformer at long lengths)
Training Optimization
- train with subquadratic SSM layers to enable long-sequence training
- segment training sequences to 30s/4min/16min to study length effects
Inference Optimization
- constant memory autoregressive decoding via recurrent SSM state
- windowed non-autoregressive acoustic decoding for parallel synthesis
Reproducibility
License
- LibriSpeech-Long released under CC-BY 4.0; model weights not released
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Model weights are not released, limiting reproducibility of model-level claims.
- Experiments focus on English audiobook-style data; real-world conversational domains may differ.
- Acoustic edge cases like extended silences or non-speech voiced sounds can still fool ASR and challenge evaluations.
When Not To Use
- If you need exact text-conditioned generation with guaranteed text fidelity — text-interleaved or TTS cascades may be preferable.
- If your use case requires released, off-the-shelf model weights for immediate deployment.
Failure Modes
- Degeneration into noise or silence when models not trained for target generation length (observed for Transformer baselines).
- Implicit end-of-sequence signals from non-causal tokenizers causing premature stops unless padded carefully.
- ASR-based metrics can hide audio-native failures like long silences or voiced non-speech.
Core Entities
Models
- SpeechSSM-2B
- SpeechSSM-9B
- SpeechSSM-X (extemporaneous)
- Griffin (hybrid SSM architecture)
- SpeechTransformer (matched Transformer baseline)
Metrics
- ASR PPL
- Gecko semantic similarity
- SBERT (short-form)
- Win% (LLM-as-judge side-by-side)
- N-MOS / N-MOST
- SCL (Semantic Coherence over Length)
- SpkrSim
- sWUGGY
- sBLiMP
Datasets
- LibriSpeech-Long
- LibriLight unlab-60k
- LibriSpeech
Benchmarks
- LibriSpeech-Long
Context Entities
Models
- GSLM
- AudioLM
- TWIST
- Spirit LM
- VoxtLM
- RecurrentGemma/Gemma-2B
Metrics
- auto-BLEU
- transcript perplexity (Gemma-2B)
- MMOS
- sStoryCloze
Datasets
- Common Voice
- Multilingual LibriSpeech
- VoxPopuli
Benchmarks
- Long-Range Arena (context for SSMs)

