SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

Overview

Decision SnapshotReady For Pilot

The paper provides quantitative long-form evaluations, throughput measurements on TPU hardware, and human ratings—strong evidence for practical gains but lacking public model weights for replication.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: LibriSpeech-Long released under CC-BY 4.0; model weights not released

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

Links

Abstract / PDF / Data

Why It Matters For Business

SpeechSSM lets teams generate multi-minute, coherent speech with fixed memory and much faster decoding, lowering infrastructure and latency costs for audiobook, podcast, or voice-agent products.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces SpeechSSM, a family of spoken language models built with hybrid state-space layers to generate speech for many minutes in a single decoding session without text intermediates. SpeechSSM pairs a semantic tokenizer (USM-v2) with a speaker-conditioned acoustic stage (SoundStorm + SoundStream) and windowed tokenization for continuous decoding. The models (2B and 9B) match prior spoken LMs on short clips and substantially outperform Transformer baselines on multi-minute coherence, while using constant memory at decode time and offering much higher throughput. The authors also release LibriSpeech-Long (long-form evaluation splits) and advocate embedding-based and LLM-judged,時間

Problem Statement

Current spoken language models fail to keep semantic and speaker coherence for generations beyond tens of seconds because speech token rates are high, Transformers scale poorly to long token sequences, and memory and compute blow up at inference. The paper asks: can we build a spoken LM that (1) generates arbitrarily long speech in bounded memory, (2) extrapolates quality beyond training lengths, and (3) can be evaluated with suitable long-form benchmarks?

Main Contribution

SpeechSSM: first spoken LM family using hybrid state-space layers (Griffin mix of gated recurrences + local attention) to generate unbounded long-form speech in fixed memory.

Practical pipeline: USM-v2 semantic tokens + speaker-prompted SoundStorm→SoundStream acoustic stage and windowed tokenization/decoding.

Key Findings

SpeechSSM-2B achieves better long-form transcript perplexity (ASR PPL) than baselines on 4min continuations.

NumbersPPL 3.75 (SpeechSSM-2B) vs 4.74 (GSLM ⊞) on 4min test-clean

Practical UseIf you need more faithful long continuations, switch to an SSM-based speech LM like SpeechSSM rather than sliding-window Transformer extensions.

Evidence RefTable 3

SpeechSSM wins more side-by-side transcript judgments versus other models on 4min continuations.

NumbersWin% vs other models: 50.0% (SpeechSSM-2B)

Practical UseUse LLM-judged transcript side-by-sides to detect content-level quality gains from long-form SSM models.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASR PPL (4min continuations)	SpeechSSM-2B: 3.75	GSLM ⊞: 4.74	−0.99	LibriSpeech-Long test-clean (4min)	Table 3 (PPL ↓)	Table 3
LLM-as-judge win rate vs baseline models (4min)	SpeechSSM-2B: 50.0% wins	TWIST-7B ⊞: 24.0% (example)	+26.0 pts	LibriSpeech-Long test-clean (4min)	Table 3 (Win% SSM-2B)	Table 3

What To Try In 7 Days

Download LibriSpeech-Long and rerun your long-continuation evaluations using Gecko embeddings and LLM side-by-side judging.

Prototype a hybrid SSM decoder (or use Griffin-like blocks) for long audio tokens to test throughput and memory benefits on your hardware.

Adopt windowed tokenization plus a speaker-conditioned acoustic stage (SoundStorm→SoundStream) to keep speaker identity when extending short models.

Agent Features

Memory

constant-size recurrent state at decode timewindowed tokenization to bound tokenizer/decoder memory

Architectures

hybrid state-space model (SSM + local attention)Griffin: gated LRUs interleaved with local multi-query attentiontwo-stage pipeline: semantic tokens → speaker-conditioned acoustic tokens

Optimization Features

Token Efficiency

USM-v2 semantic tokens at 25Hz (fixed-rate pseudo-text, 32k vocab)

Infra Optimization

demonstrated decoding 16.4k tokens (~10.9 min) in ~100s on TPU v5e

Model Optimization

initialize from text-pretrained RecurrentGemma to transfer LM knowledgeremove explicit positional encodings (NoPE) to help length extrapolation

System Optimization

higher throughput on TPU v5e with batch decoding (>120× vs Transformer at long lengths)

Training Optimization

train with subquadratic SSM layers to enable long-sequence trainingsegment training sequences to 30s/4min/16min to study length effects

Inference Optimization

constant memory autoregressive decoding via recurrent SSM statewindowed non-autoregressive acoustic decoding for parallel synthesis

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseLibriSpeech-Long released under CC-BY 4.0; model weights not released

Data URLs

https://github.com/google-deepmind/librispeech-long/https://google.github.io/tacotron/publications/speechssm/

Risks & Boundaries

Limitations

Model weights are not released, limiting reproducibility of model-level claims.

Experiments focus on English audiobook-style data; real-world conversational domains may differ.

When Not To Use

If you need exact text-conditioned generation with guaranteed text fidelity — text-interleaved or TTS cascades may be preferable.

If your use case requires released, off-the-shelf model weights for immediate deployment.

Failure Modes

Degeneration into noise or silence when models not trained for target generation length (observed for Transformer baselines).

Implicit end-of-sequence signals from non-causal tokenizers causing premature stops unless padded carefully.

Core Entities

Models

SpeechSSM-2BSpeechSSM-9BSpeechSSM-X (extemporaneous)Griffin (hybrid SSM architecture)SpeechTransformer (matched Transformer baseline)

Metrics

ASR PPLGecko semantic similaritySBERT (short-form)Win% (LLM-as-judge side-by-side)N-MOS / N-MOSTSCL (Semantic Coherence over Length)SpkrSimsWUGGYsBLiMP

Datasets

LibriSpeech-LongLibriLight unlab-60kLibriSpeech

Benchmarks

LibriSpeech-Long

Context Entities

Models

GSLMAudioLMTWISTSpirit LMVoxtLMRecurrentGemma/Gemma-2B

Metrics

auto-BLEUtranscript perplexity (Gemma-2B)MMOSsStoryCloze

Datasets

Common VoiceMultilingual LibriSpeechVoxPopuli

Benchmarks

Long-Range Arena (context for SSMs)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SpeechSSM-2B achieves better long-form transcript perplexity (ASR PPL) than baselines on 4min continuations.

SpeechSSM wins more side-by-side transcript judgments versus other models on 4min continuations.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

WavLLM: dual-encoder LLaMA with prompt-aware LoRA for robust multi-task speech understanding

Key finding

MoST: a modality-aware Mixture-of-Experts that mixes speech and text in one LLM

Key finding

Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

Key finding

LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

Key finding

Audio-aware LLMs (Gemini, GPT‑4o-audio) can judge speaking styles with human-like agreement

Key finding