Audio-Language Models Papers — Parsed & Scored for Practitioners

STC connector + audio branch: stronger video and audio understanding for Video-LLMs

0.70

0.55

0.45

10

VideoLLaMA 2 improves video and audio understanding while keeping encoder/Large-Model changes minimal; this lowers data and compute needed to reach strong open-source performance and speeds integration into product pipelines.

Key finding

Adding STC connector (RegStage + 3D conv) yields the best average video QA performance in the architecture sweep.

Numbers: Avg. acc. 45.1 (Table 1 green line)

Use an LLM to write a structured audio script, compile it to code, and run specialist audio models to generate narrated, mixed audio scenes.

0.60

0.70

0.50

6

WavJourney turns natural language briefs into finished mixed audio by chaining existing specialist models, reducing the need to build large unified audio models and enabling faster prototyping of audio content.

Key finding

WavJourney beats AudioGen and AudioLDM in human subjective scores on AudioCaps.

Numbers: OVL 3.75 vs AudioGen 3.56; REL 3.74 vs 3.52

FAVOR: frame-level audio+visual fusion and causal Q-Former to help LLMs understand speech, sounds and video together

0.60

5

FAVOR enables LLMs to reason over speech, sounds and video together at frame level, improving video QA and matching tasks that power search, content moderation, AV indexing, and multimedia assistants.

Key finding

FAVOR substantially improves video QA accuracy on the evaluated AVEB split.

Numbers: FAVOR 13B Video QA 49.3% vs InstructBLIP 13B 21.0% (Table 2)

Audio Flamingo: an audio-aware LLM with few-shot learning, retrieval, and multi-turn chat

0.70

0.60

0.50

4

Audio Flamingo brings strong audio understanding and fast few-shot adaptation into a single, relatively small model—useful for building audio assistants, content moderation, audio search, and music tools without costly per-task fine-tuning.

Key finding

Outperforms prior SOTA on many audio benchmarks (captioning, QA, classification).

Numbers: Clotho-v2 CIDEr 0.465 vs 0.441; ClothoAQA unanimous 86.9% vs 74.9%

WavLLM: dual-encoder LLaMA with prompt-aware LoRA for robust multi-task speech understanding

0.80

0.60

2

WavLLM shows a practical path to add robust speech understanding to chat LLMs without re-training big LLM weights; it delivers higher accuracy on multi-step speech tasks and better robustness to prompt variation, cutting error rates and reducing manual prompt engineering.

Key finding

State-of-the-art ASR for 7B speech-chat models on LibriSpeech.

Numbers: WER 2.0% (test-clean), 4.8% (test-other)

Step-Audio: production-ready unified speech-text model with dual-codebook audio tokens, synthetic TTS data engine, and real-time tool-calls

0.80

0.75

0.80

0

Step-Audio cuts voice data costs with a synthetic-data TTS engine and delivers a production-ready speech agent that supports real-time tool calls and fine-grained voice control—useful for voice assistants, contact centers, and localization pipelines.

Key finding

Dual-codebook tokenization reduces ASR CER on tested ASR sets.

Numbers: CER improved from 25.5% → 18.4% (3B ASR ablation)

SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

0.60

0.70

0.60

0

SpeechSSM lets teams generate multi-minute, coherent speech with fixed memory and much faster decoding, lowering infrastructure and latency costs for audiobook, podcast, or voice-agent products.

Key finding

SpeechSSM-2B achieves better long-form transcript perplexity (ASR PPL) than baselines on 4min continuations.

Numbers: PPL 3.75 (SpeechSSM-2B) vs 4.74 (GSLM ⊞) on 4min test-clean

Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

0.45

0.60

0.70

0

E2E spoken QA can cut deployment cost and model footprint while keeping similar zero‑shot accuracy, making it attractive for resource‑limited or edge medical apps and privacy‑sensitive deployments.

Key finding

End‑to‑end zero‑shot entailment uses far fewer parameters while matching accuracy.

Numbers: Up to 14.7× fewer params; +0.5% avg accuracy

Fine-tuning HuBERT with sentence-level self-distillation produces clear syllable-like segments and faster unsupervised syllable discovery

0.60

0.65

0.45

0

SD-HuBERT provides better unsupervised syllable units and faster segmentation without labelled data, saving annotation cost and improving unit quality for speech products like spoken language models and TTS.

Key finding

SD-HuBERT improves unsupervised syllable boundary detection over baselines.

Numbers: Syllable F1 = 67% (HuBERT 35%, VG-HuBERT 64%)

MoST: a modality-aware Mixture-of-Experts that mixes speech and text in one LLM

0.70

0.60

0

MoST shows modality-aware MoE can improve speech/text quality while using only open data and sparse activation, enabling efficient production systems for ASR, TTS, and spoken QA.

Key finding

MoST delivers competitive ASR and TTS accuracy on standard English benchmarks.

Numbers: ASR WER LS-Clean 2.0%, LS-Other 3.7%; TTS WER LS-Clean 6.0%

OWL: teach audio LLMs room geometry (depth + RIR) so they localize sound better and explain why

0.60

0.75

0.50

0

Geometry-aware audio models give more accurate and explainable direction/distance estimates from binaural audio, which helps applications like robot navigation, AR audio placement, and multi-source monitoring.

Key finding

Geometry supervision improves angular localization.

Numbers: 11° reduction in mean angular error (DoA)

Audio-aware LLMs (Gemini, GPT‑4o-audio) can judge speaking styles with human-like agreement

0.40

0.30

0.35

0

Automated ALLM judges can speed and cut cost for speech-style QA and model comparisons, replacing many routine human labels for style-focused tests.

Key finding

Gemini judge correlates with human raters on voice-style instruction-following.

Numbers: Pearson's r = 0.640 (Gemini–human) vs human–human 0.596

WhisperInject: covertly embed model-native harmful text into benign audio to jailbreak multimodal LLMs

1.00

0.80

0.60

0

Voice interfaces can be hijacked to make models produce harmful or policy-violating text without changing what humans hear, so companies must add audio-level safety, monitoring, and access controls.

Key finding

Stage 1 (RL-PGD) reliably finds model-native harmful payloads.

Numbers: AdvBench 86.7% avg, JailbreakBench 67.0% avg

LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

0.70

0.60

0.70

0

Reducing false audio detections increases trust in systems (e.g., emergency alerts) while cutting data and compute needs by training only a small adapter instead of full LLMs.

Key finding

Including synthesized negative samples raises audio-hallucination accuracy from 66.0% to 77.5% on the evaluated benchmark.

Numbers: Acc: Qwen-Audio-Chat 66.0% -> Ours (Positive+Negative) 77.5% (Table 3)