SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

December 24, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

Links

Abstract / PDF

Why It Matters For Business

SpeechSSM lets teams generate multi-minute, coherent speech with fixed memory and much faster decoding, lowering infrastructure and latency costs for audiobook, podcast, or voice-agent products.

Summary TLDR

The paper introduces SpeechSSM, a family of spoken language models built with hybrid state-space layers to generate speech for many minutes in a single decoding session without text intermediates. SpeechSSM pairs a semantic tokenizer (USM-v2) with a speaker-conditioned acoustic stage (SoundStorm + SoundStream) and windowed tokenization for continuous decoding. The models (2B and 9B) match prior spoken LMs on short clips and substantially outperform Transformer baselines on multi-minute coherence, while using constant memory at decode time and offering much higher throughput. The authors also release LibriSpeech-Long (long-form evaluation splits) and advocate embedding-based and LLM-judged,時間

Problem Statement

Current spoken language models fail to keep semantic and speaker coherence for generations beyond tens of seconds because speech token rates are high, Transformers scale poorly to long token sequences, and memory and compute blow up at inference. The paper asks: can we build a spoken LM that (1) generates arbitrarily long speech in bounded memory, (2) extrapolates quality beyond training lengths, and (3) can be evaluated with suitable long-form benchmarks?

Main Contribution

SpeechSSM: first spoken LM family using hybrid state-space layers (Griffin mix of gated recurrences + local attention) to generate unbounded long-form speech in fixed memory.

Practical pipeline: USM-v2 semantic tokens + speaker-prompted SoundStorm→SoundStream acoustic stage and windowed tokenization/decoding.

LibriSpeech-Long: a 4min-target reprocessing of LibriSpeech dev/test to enable long-form reference evaluations.

New long-form evaluation suite: embedding-based semantic metrics (Gecko), LLM-as-judge side-by-side win-rates, and time-stratified metrics (N-MOST, SCL).

Extemporaneous variant SpeechSSM-X trained on 216k hours of informal monologues for spontaneous-style generation.

Key Findings

SpeechSSM-2B achieves better long-form transcript perplexity (ASR PPL) than baselines on 4min continuations.

NumbersPPL 3.75 (SpeechSSM-2B) vs 4.74 (GSLM ⊞) on 4min test-clean

SpeechSSM wins more side-by-side transcript judgments versus other models on 4min continuations.

NumbersWin% vs other models: 50.0% (SpeechSSM-2B)

SpeechSSM maintains perceived naturalness across minutes while many baselines degrade rapidly.

NumbersN-MOS ~4.12 across minutes for SpeechSSM-2B vs drop to ≈1.4–2.4 for some baselines

Decoding efficiency and memory are vastly improved versus a Transformer LM.

Numbers>120× throughput advantage on long batches; 16.4k tokens decode (~10.9 min) in ~100s (real-time factor <0.2×)

Windowed extension (slide-and-prompt) of short-sequence models is a poor substitute for an SSM trained for long sequences.

NumbersWindowed baselines have lower SCL and N-MOS by minute one and often collapse to noise/silence

Results

ASR PPL (4min continuations)

ValueSpeechSSM-2B: 3.75

BaselineGSLM ⊞: 4.74

LLM-as-judge win rate vs baseline models (4min)

ValueSpeechSSM-2B: 50.0% wins

BaselineTWIST-7B ⊞: 24.0% (example)

Naturalness MOS over time (N-MOST)

ValueSpeechSSM-2B: ~4.12 (each minute sample)

BaselineTWIST-7B ⊞: drops to 1.36–2.43 by minutes 2–4

Throughput / decoding speed

ValueSpeechSSM >120× SpeechTransformer on long batches

BaselineSpeechTransformer

16min unconditional PPL

ValueSpeechSSM-9B: 3.39; SpeechSSM-2B: 3.59

BaselineTWIST-7B ⊞: 4.45

Who Should Care

What To Try In 7 Days

Download LibriSpeech-Long and rerun your long-continuation evaluations using Gecko embeddings and LLM side-by-side judging.

Prototype a hybrid SSM decoder (or use Griffin-like blocks) for long audio tokens to test throughput and memory benefits on your hardware.

Adopt windowed tokenization plus a speaker-conditioned acoustic stage (SoundStorm→SoundStream) to keep speaker identity when extending short models.

Agent Features

Memory

  • constant-size recurrent state at decode time
  • windowed tokenization to bound tokenizer/decoder memory

Architectures

  • hybrid state-space model (SSM + local attention)
  • Griffin: gated LRUs interleaved with local multi-query attention
  • two-stage pipeline: semantic tokens → speaker-conditioned acoustic tokens

Optimization Features

Token Efficiency

  • USM-v2 semantic tokens at 25Hz (fixed-rate pseudo-text, 32k vocab)

Infra Optimization

  • demonstrated decoding 16.4k tokens (~10.9 min) in ~100s on TPU v5e

Model Optimization

  • initialize from text-pretrained RecurrentGemma to transfer LM knowledge
  • remove explicit positional encodings (NoPE) to help length extrapolation

System Optimization

  • higher throughput on TPU v5e with batch decoding (>120× vs Transformer at long lengths)

Training Optimization

  • train with subquadratic SSM layers to enable long-sequence training
  • segment training sequences to 30s/4min/16min to study length effects

Inference Optimization

  • constant memory autoregressive decoding via recurrent SSM state
  • windowed non-autoregressive acoustic decoding for parallel synthesis

Reproducibility

License

  • LibriSpeech-Long released under CC-BY 4.0; model weights not released

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model weights are not released, limiting reproducibility of model-level claims.
  • Experiments focus on English audiobook-style data; real-world conversational domains may differ.
  • Acoustic edge cases like extended silences or non-speech voiced sounds can still fool ASR and challenge evaluations.

When Not To Use

  • If you need exact text-conditioned generation with guaranteed text fidelity — text-interleaved or TTS cascades may be preferable.
  • If your use case requires released, off-the-shelf model weights for immediate deployment.

Failure Modes

  • Degeneration into noise or silence when models not trained for target generation length (observed for Transformer baselines).
  • Implicit end-of-sequence signals from non-causal tokenizers causing premature stops unless padded carefully.
  • ASR-based metrics can hide audio-native failures like long silences or voiced non-speech.

Core Entities

Models

  • SpeechSSM-2B
  • SpeechSSM-9B
  • SpeechSSM-X (extemporaneous)
  • Griffin (hybrid SSM architecture)
  • SpeechTransformer (matched Transformer baseline)

Metrics

  • ASR PPL
  • Gecko semantic similarity
  • SBERT (short-form)
  • Win% (LLM-as-judge side-by-side)
  • N-MOS / N-MOST
  • SCL (Semantic Coherence over Length)
  • SpkrSim
  • sWUGGY
  • sBLiMP

Datasets

  • LibriSpeech-Long
  • LibriLight unlab-60k
  • LibriSpeech

Benchmarks

  • LibriSpeech-Long

Context Entities

Models

  • GSLM
  • AudioLM
  • TWIST
  • Spirit LM
  • VoxtLM
  • RecurrentGemma/Gemma-2B

Metrics

  • auto-BLEU
  • transcript perplexity (Gemma-2B)
  • MMOS
  • sStoryCloze

Datasets

  • Common Voice
  • Multilingual LibriSpeech
  • VoxPopuli

Benchmarks

  • Long-Range Arena (context for SSMs)