Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

Overview

Decision SnapshotNeeds Validation

The idea is practical and lowers parameter cost, but evidence is zero‑shot on synthetic TTS data; results may change with real speech or fine‑tuning.

Citations0

Evidence Strength0.60

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 45%

Novelty: 60%

Authors

Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier

Links

Abstract / PDF / Code / Data

Why It Matters For Business

E2E spoken QA can cut deployment cost and model footprint while keeping similar zero‑shot accuracy, making it attractive for resource‑limited or edge medical apps and privacy‑sensitive deployments.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

This paper builds a synthetic medical spoken multiple‑choice benchmark (≈48 hours, 6,545 questions) and tests zero‑shot end‑to‑end (E2E) audio→text entailment methods against standard cascade systems (ASR → LLM). E2E models (Whisper, CLAP, Pengi, SpeechGPT) can match or slightly exceed cascade accuracy for similarly sized systems while using far fewer parameters in some setups—up to 14.7× fewer—under zero‑shot conditions. Results are based on synthetic TTS audio and the zero‑shot setting, so expect gaps versus tuned models and real conversational speech.

Problem Statement

Spoken QA in healthcare needs deep understanding across long audio. Typical pipelines transcribe speech (ASR) then run an LLM, which adds compute cost and compounds errors. The paper asks: can a single end‑to‑end speech model perform zero‑shot multiple‑choice medical QA with less compute and similar accuracy?

Main Contribution

An audio→text entailment method to do zero‑shot multiple‑choice SQA from speech.

A synthetic medical spoken QA benchmark (≈47h audio, 6,545 items) derived from MMLU, MedQA, MedMCQA.

Key Findings

End‑to‑end zero‑shot entailment uses far fewer parameters while matching accuracy.

NumbersUp to 14.7× fewer params; +0.5% avg accuracy

Practical UseUse E2E entailment to cut model size and hardware needs when you must deploy lightweight SQA systems without fine‑tuning.

Evidence RefAbstract; Section 4.2

Small contrastive audio model (CLAP) performs competitively despite tiny size.

NumbersCLAP: 153M–193M params; 14.7×–44.3× smaller vs some cascades

Practical UseTry compact contrastive audio encoders (CLAP) for low‑cost SQA prototypes before scaling to big LLMs.

Evidence RefSection 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	E2E models: ~24–27.5% avg (varies by model and task)	Cascade best (LLaMa2 13B + Whisper Medium): 38.3% avg	E2E ≈ up to −11% vs top cascade; comparable to some smaller cascades	Across 8 medical SQA tasks (MMLU subsets, MedQA, MedMCQA)	Table 3; Table 1	Section 4.1; Section 4.2
Resource comparison (parameters)	E2E up to 14.7× fewer params than a cascade of 1.3B LLM + 1.55B ASR	Cascade: 1.3B LLM + 1.55B ASR	Up to 14.7× reduction	Overall benchmark	Abstract; Section 4.2	Abstract; Section 4.2

What To Try In 7 Days

Run CLAP entailment on a small set of medical audio MCQs to prototype low‑cost SQA.

Compare Whisper Medium E2E entailment vs your ASR+LLM cascade by measuring task accuracy not just WER.

Synthesize a small TTS medical QA set and test model sensitivity to speaker variety and audio length.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://huggingface.co/SpokenMedicalQA

Data URLs

https://huggingface.co/SpokenMedicalQA

Risks & Boundaries

Limitations

Audio is synthetic TTS with limited speaker diversity; not equal to real clinical speech.

Zero‑shot evaluation only; no fine‑tuning experiments shown.

When Not To Use

For high‑stakes clinical QA where labeled real‑world speech and fine‑tuning are possible.

When conversational context or multi‑turn dialogue matters.

Failure Modes

Performance drops on real, noisy, or accent‑diverse speech due to synthetic training audio.

Encoder misalignment causing SpeechGPT and similar models to underperform in speech inputs.

Core Entities

Models

Whisper SmallWhisper MediumWhisper Large V2CLAP (base, fused, large general)PengiSpeechGPTHuBERTwav2vec2WavLMData2VecPhi 1.5LLaMa 2 (7B, 13B)

Metrics

AccuracyWord Error Rate (WER)

Datasets

SpokenMedicalQA (synthetic)MMLU (6 healthcare subjects subset)MedQA (test set)MedMCQA (validation used as test)

Benchmarks

New medical spoken multiple‑choice benchmark (8 tasks, 6,545 items, 47h41m)

Context Entities

Models

BloomLLaMa 2SpeechT5

Datasets

Clotho-AQASpoken-SQuADLibriSQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

End‑to‑end zero‑shot entailment uses far fewer parameters while matching accuracy.

Small contrastive audio model (CLAP) performs competitively despite tiny size.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

WavLLM: dual-encoder LLaMA with prompt-aware LoRA for robust multi-task speech understanding

Key finding

SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

Key finding

MoST: a modality-aware Mixture-of-Experts that mixes speech and text in one LLM

Key finding

LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

Key finding

Audio-aware LLMs (Gemini, GPT‑4o-audio) can judge speaking styles with human-like agreement

Key finding