Overview
Production Readiness
0.45
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
E2E spoken QA can cut deployment cost and model footprint while keeping similar zero‑shot accuracy, making it attractive for resource‑limited or edge medical apps and privacy‑sensitive deployments.
Summary TLDR
This paper builds a synthetic medical spoken multiple‑choice benchmark (≈48 hours, 6,545 questions) and tests zero‑shot end‑to‑end (E2E) audio→text entailment methods against standard cascade systems (ASR → LLM). E2E models (Whisper, CLAP, Pengi, SpeechGPT) can match or slightly exceed cascade accuracy for similarly sized systems while using far fewer parameters in some setups—up to 14.7× fewer—under zero‑shot conditions. Results are based on synthetic TTS audio and the zero‑shot setting, so expect gaps versus tuned models and real conversational speech.
Problem Statement
Spoken QA in healthcare needs deep understanding across long audio. Typical pipelines transcribe speech (ASR) then run an LLM, which adds compute cost and compounds errors. The paper asks: can a single end‑to‑end speech model perform zero‑shot multiple‑choice medical QA with less compute and similar accuracy?
Main Contribution
An audio→text entailment method to do zero‑shot multiple‑choice SQA from speech.
A synthetic medical spoken QA benchmark (≈47h audio, 6,545 items) derived from MMLU, MedQA, MedMCQA.
A head‑to‑head zero‑shot comparison of 4 E2E audio models and cascade ASR+LLM systems.
An analysis showing where SQA information sits across encoder layers in common audio encoders.
Key Findings
End‑to‑end zero‑shot entailment uses far fewer parameters while matching accuracy.
Small contrastive audio model (CLAP) performs competitively despite tiny size.
Lower ASR word error rate (WER) does not always mean better SQA accuracy in cascades.
Larger LLMs improve cascade zero‑shot QA accuracy.
Results
Accuracy
Resource comparison (parameters)
ASR transcriptions (WER)
Who Should Care
What To Try In 7 Days
Run CLAP entailment on a small set of medical audio MCQs to prototype low‑cost SQA.
Compare Whisper Medium E2E entailment vs your ASR+LLM cascade by measuring task accuracy not just WER.
Synthesize a small TTS medical QA set and test model sensitivity to speaker variety and audio length.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Audio is synthetic TTS with limited speaker diversity; not equal to real clinical speech.
- Zero‑shot evaluation only; no fine‑tuning experiments shown.
- Simplified single‑turn MCQ format lacks real conversation dynamics.
- No multilingual experiments; only English tested.
When Not To Use
- For high‑stakes clinical QA where labeled real‑world speech and fine‑tuning are possible.
- When conversational context or multi‑turn dialogue matters.
- If you require best possible accuracy and can afford large ASR+LLM stacks.
Failure Modes
- Performance drops on real, noisy, or accent‑diverse speech due to synthetic training audio.
- Encoder misalignment causing SpeechGPT and similar models to underperform in speech inputs.
- Task framing mismatch: simplified MCQ format may not reflect clinical question complexity.
Core Entities
Models
- Whisper Small
- Whisper Medium
- Whisper Large V2
- CLAP (base, fused, large general)
- Pengi
- SpeechGPT
- HuBERT
- wav2vec2
- WavLM
- Data2Vec
- Phi 1.5
- LLaMa 2 (7B, 13B)
Metrics
- Accuracy
- Word Error Rate (WER)
Datasets
- SpokenMedicalQA (synthetic)
- MMLU (6 healthcare subjects subset)
- MedQA (test set)
- MedMCQA (validation used as test)
Benchmarks
- New medical spoken multiple‑choice benchmark (8 tasks, 6,545 items, 47h41m)
Context Entities
Models
- Bloom
- LLaMa 2
- SpeechT5
Datasets
- Clotho-AQA
- Spoken-SQuAD
- LibriSQA

