SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

December 24, 20249 min

Overview

Decision SnapshotReady For Pilot

The paper provides quantitative long-form evaluations, throughput measurements on TPU hardware, and human ratings—strong evidence for practical gains but lacking public model weights for replication.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: LibriSpeech-Long released under CC-BY 4.0; model weights not released

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

Links

Abstract / PDF / Data

Why It Matters For Business

SpeechSSM lets teams generate multi-minute, coherent speech with fixed memory and much faster decoding, lowering infrastructure and latency costs for audiobook, podcast, or voice-agent products.

Who Should Care

Summary TLDR

The paper introduces SpeechSSM, a family of spoken language models built with hybrid state-space layers to generate speech for many minutes in a single decoding session without text intermediates. SpeechSSM pairs a semantic tokenizer (USM-v2) with a speaker-conditioned acoustic stage (SoundStorm + SoundStream) and windowed tokenization for continuous decoding. The models (2B and 9B) match prior spoken LMs on short clips and substantially outperform Transformer baselines on multi-minute coherence, while using constant memory at decode time and offering much higher throughput. The authors also release LibriSpeech-Long (long-form evaluation splits) and advocate embedding-based and LLM-judged,時間

Problem Statement

Current spoken language models fail to keep semantic and speaker coherence for generations beyond tens of seconds because speech token rates are high, Transformers scale poorly to long token sequences, and memory and compute blow up at inference. The paper asks: can we build a spoken LM that (1) generates arbitrarily long speech in bounded memory, (2) extrapolates quality beyond training lengths, and (3) can be evaluated with suitable long-form benchmarks?

Main Contribution

SpeechSSM: first spoken LM family using hybrid state-space layers (Griffin mix of gated recurrences + local attention) to generate unbounded long-form speech in fixed memory.

Practical pipeline: USM-v2 semantic tokens + speaker-prompted SoundStorm→SoundStream acoustic stage and windowed tokenization/decoding.

Key Findings

SpeechSSM-2B achieves better long-form transcript perplexity (ASR PPL) than baselines on 4min continuations.

NumbersPPL 3.75 (SpeechSSM-2B) vs 4.74 (GSLM ⊞) on 4min test-clean

Practical UseIf you need more faithful long continuations, switch to an SSM-based speech LM like SpeechSSM rather than sliding-window Transformer extensions.

Evidence RefTable 3

SpeechSSM wins more side-by-side transcript judgments versus other models on 4min continuations.

NumbersWin% vs other models: 50.0% (SpeechSSM-2B)

Practical UseUse LLM-judged transcript side-by-sides to detect content-level quality gains from long-form SSM models.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASR PPL (4min continuations)SpeechSSM-2B: 3.75GSLM ⊞: 4.74−0.99LibriSpeech-Long test-clean (4min)Table 3 (PPL ↓)Table 3
LLM-as-judge win rate vs baseline models (4min)SpeechSSM-2B: 50.0% winsTWIST-7B ⊞: 24.0% (example)+26.0 ptsLibriSpeech-Long test-clean (4min)Table 3 (Win% SSM-2B)Table 3

What To Try In 7 Days

Download LibriSpeech-Long and rerun your long-continuation evaluations using Gecko embeddings and LLM side-by-side judging.

Prototype a hybrid SSM decoder (or use Griffin-like blocks) for long audio tokens to test throughput and memory benefits on your hardware.

Adopt windowed tokenization plus a speaker-conditioned acoustic stage (SoundStorm→SoundStream) to keep speaker identity when extending short models.

Agent Features

Memory
constant-size recurrent state at decode timewindowed tokenization to bound tokenizer/decoder memory
Architectures
hybrid state-space model (SSM + local attention)Griffin: gated LRUs interleaved with local multi-query attentiontwo-stage pipeline: semantic tokens → speaker-conditioned acoustic tokens

Optimization Features

Token Efficiency
USM-v2 semantic tokens at 25Hz (fixed-rate pseudo-text, 32k vocab)
Infra Optimization
demonstrated decoding 16.4k tokens (~10.9 min) in ~100s on TPU v5e
Model Optimization
initialize from text-pretrained RecurrentGemma to transfer LM knowledgeremove explicit positional encodings (NoPE) to help length extrapolation
System Optimization
higher throughput on TPU v5e with batch decoding (>120× vs Transformer at long lengths)
Training Optimization
train with subquadratic SSM layers to enable long-sequence trainingsegment training sequences to 30s/4min/16min to study length effects
Inference Optimization
constant memory autoregressive decoding via recurrent SSM statewindowed non-autoregressive acoustic decoding for parallel synthesis

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseLibriSpeech-Long released under CC-BY 4.0; model weights not released

Risks & Boundaries

Limitations

Model weights are not released, limiting reproducibility of model-level claims.

Experiments focus on English audiobook-style data; real-world conversational domains may differ.

When Not To Use

If you need exact text-conditioned generation with guaranteed text fidelity — text-interleaved or TTS cascades may be preferable.

If your use case requires released, off-the-shelf model weights for immediate deployment.

Failure Modes

Degeneration into noise or silence when models not trained for target generation length (observed for Transformer baselines).

Implicit end-of-sequence signals from non-causal tokenizers causing premature stops unless padded carefully.

Core Entities

Models

SpeechSSM-2BSpeechSSM-9BSpeechSSM-X (extemporaneous)Griffin (hybrid SSM architecture)SpeechTransformer (matched Transformer baseline)

Metrics

ASR PPLGecko semantic similaritySBERT (short-form)Win% (LLM-as-judge side-by-side)N-MOS / N-MOSTSCL (Semantic Coherence over Length)SpkrSimsWUGGYsBLiMP

Datasets

LibriSpeech-LongLibriLight unlab-60kLibriSpeech

Benchmarks

LibriSpeech-Long

Context Entities

Models

GSLMAudioLMTWISTSpirit LMVoxtLMRecurrentGemma/Gemma-2B

Metrics

auto-BLEUtranscript perplexity (Gemma-2B)MMOSsStoryCloze

Datasets

Common VoiceMultilingual LibriSpeechVoxPopuli

Benchmarks

Long-Range Arena (context for SSMs)