WavLLM: dual-encoder LLaMA with prompt-aware LoRA for robust multi-task speech understanding

March 31, 20248 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

WavLLM shows a practical path to add robust speech understanding to chat LLMs without re-training big LLM weights; it delivers higher accuracy on multi-step speech tasks and better robustness to prompt variation, cutting error rates and reducing manual prompt engineering.

Summary TLDR

WavLLM adds listening to a chat LLM by combining two frozen speech encoders (Whisper for semantics, WavLM for speaker/acoustic cues), modality adapters, and a prompt-aware LoRA adapter. Training uses a two-stage curriculum: mixed single-task fine-tuning then advanced multi-task training with GPT-4–generated prompts. Results on public speech benchmarks and in-house multi-task/CoT tests show state-of-the-art ASR (WER 2.0/4.8) and large gains in multi-task instruction following (IFR 92.5% vs. 24–58% for other 7B models). Code, models, and evaluation data are available on the project GitHub.

Problem Statement

Open speech-LLMs struggle to generalize to unseen or complex multi-task instructions. They are sensitive to prompt wording and often fail to decompose complex tasks into substeps (Chain-of-Thought). Prior LoRA tuning used a single scaling setting for all prompts, hurting multi-task performance and robustness.

Main Contribution

A dual-encoder architecture: Whisper for semantic content and WavLM for acoustic/speaker features.

A two-stage curriculum: mixed single-task fine-tuning, then advanced multi-task training with prompt-aware LoRA adaptation.

A prompt-aware LoRA weight adapter that produces prompt-dependent scaling of LoRA updates.

A large, GPT-4 augmented multi-task and SQA training set and an in-house CoT evaluation to measure multi-step speech reasoning.

Key Findings

State-of-the-art ASR for 7B speech-chat models on LibriSpeech.

NumbersWER 2.0% (test-clean), 4.8% (test-other)

Large improvement in following multi-task instructions after advanced training.

NumbersII-task IFR 92.5% (advanced) vs 26.25% (mixed stage)

Higher accuracy on multi-task completion vs prior open-source systems.

NumbersII-task accuracy 62.44% vs 19.58–37.99% (other models)

Prompt-aware adapter improves single-task and multi-task metrics.

NumbersASR WER 2.1/5.1 -> 2.0/4.9; ST BLEU +0.3–0.6

Decoupling acoustic vs semantic encoding improves robustness.

NumbersRelative ASR WER reduction ~13.0% / 11.1% (test-clean/test-other) with WavLM

Results

ASR WER (LibriSpeech)

Value2.0% (test-clean), 4.8% (test-other)

BaselineWhisper+LLaMA baseline 2.7% / 5.2%

ST (En->De) BLEU

Value23.6 (CoVoST2), 21.7 (MUSTC)

BaselineQwen-Audio-Chat 23.2 / 18.4

II-Task (multi-task) instructional metrics

ValueAccuracy 62.44%, IFR 92.50%

BaselineSALMONN/Qwen 19.58–37.99% acc, IFR 24.25–57.75%

CoT task (ASR+SUMM+En2De) (Gigaword)

ValueR-1 16.5, R-2 4.1, R-L 14.7, BERTScore 70.60

BaselineSALMONN-7B R-1 11.9 R-2 2.4 R-L 10.7

Zero-shot SQA (English Listening Comprehension)

Value67.55% (strict) / 67.55% (semantically equivalent included)

BaselineWhisper+LLaMA 59.30% ; Qwen-Audio-Chat 25.50% (strict) / 54.25%

Who Should Care

What To Try In 7 Days

Prototype a dual-encoder input (semantic + acoustic) into your LLM pipeline.

Add a small prompt-aware adapter that scales LoRA per instruction style.

Build a small curriculum: single-task fine-tune then multi-task mixing with varied prompts (use GPT-4 to generate prompts).

Optimization Features

Token Efficiency

  • Modality adapters downsample to 80 ms stride before LLM

Model Optimization

  • LoRA

System Optimization

  • Freeze large frozen backbones (Whisper, WavLM, LLaMA) to reduce trainable params

Training Optimization

  • Two-stage curriculum learning: mixed single-task then advanced multi-task
  • GPT-4 generated diverse prompts for instruction variety

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Cannot autonomously convert a one-shot complex task into Chain-of-Thought steps.
  • Focuses mainly on English speech; cross-lingual coverage requires extra training.
  • Does not synthesize speech (no speech generation capability).
  • Continuous speech representations may increase adversarial vulnerability.

When Not To Use

  • If you need a model to generate speech audio (TTS); WavLLM does not synthesize speech.
  • If you require autonomous decomposition of arbitrary one-shot tasks into CoT without extra signals.
  • Where full training-data provenance and exact reproducibility of large-scale mixed datasets is mandatory.

Failure Modes

  • Confusing 'transcription' vs 'translation' in some cases leading to wrong output type.
  • Repetition or verbatim repetition of instructions instead of completing them.
  • Partial task completion in multi-task instructions if trained without prompt adapter.
  • Potential adversarial prompts causing unsafe or incorrect responses.

Core Entities

Models

  • WavLLM (this paper, 7B)
  • Whisper-large-v2
  • WavLM-base
  • LLaMA-2-7B-chat
  • SALMONN
  • Qwen-Audio-Chat
  • Whisper+LLaMA (baseline)
  • LoRA

Metrics

  • WER
  • BLEU
  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BERTScore
  • IFR (Instruction-Following Rate)

Datasets

  • LibriSpeech
  • CoVoST2
  • MuST-C
  • VoxCeleb
  • IEMOCAP
  • AMI
  • Fisher
  • Switchboard
  • Gigaword
  • Alpaca

Benchmarks

  • LibriSpeech test-clean/test-other
  • CoVoST2 (En->De)
  • MUST-C
  • VoxCeleb1
  • IEMOCAP
  • In-house SQA and II-Task
  • In-house CoT evaluation