Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

June 9, 20247 min

Overview

Production Readiness

0.45

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier

Links

Abstract / PDF

Why It Matters For Business

E2E spoken QA can cut deployment cost and model footprint while keeping similar zero‑shot accuracy, making it attractive for resource‑limited or edge medical apps and privacy‑sensitive deployments.

Summary TLDR

This paper builds a synthetic medical spoken multiple‑choice benchmark (≈48 hours, 6,545 questions) and tests zero‑shot end‑to‑end (E2E) audio→text entailment methods against standard cascade systems (ASR → LLM). E2E models (Whisper, CLAP, Pengi, SpeechGPT) can match or slightly exceed cascade accuracy for similarly sized systems while using far fewer parameters in some setups—up to 14.7× fewer—under zero‑shot conditions. Results are based on synthetic TTS audio and the zero‑shot setting, so expect gaps versus tuned models and real conversational speech.

Problem Statement

Spoken QA in healthcare needs deep understanding across long audio. Typical pipelines transcribe speech (ASR) then run an LLM, which adds compute cost and compounds errors. The paper asks: can a single end‑to‑end speech model perform zero‑shot multiple‑choice medical QA with less compute and similar accuracy?

Main Contribution

An audio→text entailment method to do zero‑shot multiple‑choice SQA from speech.

A synthetic medical spoken QA benchmark (≈47h audio, 6,545 items) derived from MMLU, MedQA, MedMCQA.

A head‑to‑head zero‑shot comparison of 4 E2E audio models and cascade ASR+LLM systems.

An analysis showing where SQA information sits across encoder layers in common audio encoders.

Key Findings

End‑to‑end zero‑shot entailment uses far fewer parameters while matching accuracy.

NumbersUp to 14.7× fewer params; +0.5% avg accuracy

Small contrastive audio model (CLAP) performs competitively despite tiny size.

NumbersCLAP: 153M–193M params; 14.7×–44.3× smaller vs some cascades

Lower ASR word error rate (WER) does not always mean better SQA accuracy in cascades.

NumbersWhisper avg WER: Small 8.53% | Medium 7.14% | Large‑V2 6.87%

Larger LLMs improve cascade zero‑shot QA accuracy.

NumbersUp to 11.67% accuracy gap between small LLM (Phi 1.5) and LLaMa‑2 13B (on Whisper Medium)

Results

Accuracy

ValueE2E models: ~24–27.5% avg (varies by model and task)

BaselineCascade best (LLaMa2 13B + Whisper Medium): 38.3% avg

Resource comparison (parameters)

ValueE2E up to 14.7× fewer params than a cascade of 1.3B LLM + 1.55B ASR

BaselineCascade: 1.3B LLM + 1.55B ASR

ASR transcriptions (WER)

ValueWhisper Large V2 avg WER 6.87% (Medium 7.14%, Small 8.53%)

Who Should Care

What To Try In 7 Days

Run CLAP entailment on a small set of medical audio MCQs to prototype low‑cost SQA.

Compare Whisper Medium E2E entailment vs your ASR+LLM cascade by measuring task accuracy not just WER.

Synthesize a small TTS medical QA set and test model sensitivity to speaker variety and audio length.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Audio is synthetic TTS with limited speaker diversity; not equal to real clinical speech.
  • Zero‑shot evaluation only; no fine‑tuning experiments shown.
  • Simplified single‑turn MCQ format lacks real conversation dynamics.
  • No multilingual experiments; only English tested.

When Not To Use

  • For high‑stakes clinical QA where labeled real‑world speech and fine‑tuning are possible.
  • When conversational context or multi‑turn dialogue matters.
  • If you require best possible accuracy and can afford large ASR+LLM stacks.

Failure Modes

  • Performance drops on real, noisy, or accent‑diverse speech due to synthetic training audio.
  • Encoder misalignment causing SpeechGPT and similar models to underperform in speech inputs.
  • Task framing mismatch: simplified MCQ format may not reflect clinical question complexity.

Core Entities

Models

  • Whisper Small
  • Whisper Medium
  • Whisper Large V2
  • CLAP (base, fused, large general)
  • Pengi
  • SpeechGPT
  • HuBERT
  • wav2vec2
  • WavLM
  • Data2Vec
  • Phi 1.5
  • LLaMa 2 (7B, 13B)

Metrics

  • Accuracy
  • Word Error Rate (WER)

Datasets

  • SpokenMedicalQA (synthetic)
  • MMLU (6 healthcare subjects subset)
  • MedQA (test set)
  • MedMCQA (validation used as test)

Benchmarks

  • New medical spoken multiple‑choice benchmark (8 tasks, 6,545 items, 47h41m)

Context Entities

Models

  • Bloom
  • LLaMa 2
  • SpeechT5

Datasets

  • Clotho-AQA
  • Spoken-SQuAD
  • LibriSQA