Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

June 9, 20247 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and lowers parameter cost, but evidence is zero‑shot on synthetic TTS data; results may change with real speech or fine‑tuning.

Citations0

Evidence Strength0.60

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 45%

Novelty: 60%

Authors

Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier

Links

Abstract / PDF / Code / Data

Why It Matters For Business

E2E spoken QA can cut deployment cost and model footprint while keeping similar zero‑shot accuracy, making it attractive for resource‑limited or edge medical apps and privacy‑sensitive deployments.

Who Should Care

Summary TLDR

This paper builds a synthetic medical spoken multiple‑choice benchmark (≈48 hours, 6,545 questions) and tests zero‑shot end‑to‑end (E2E) audio→text entailment methods against standard cascade systems (ASR → LLM). E2E models (Whisper, CLAP, Pengi, SpeechGPT) can match or slightly exceed cascade accuracy for similarly sized systems while using far fewer parameters in some setups—up to 14.7× fewer—under zero‑shot conditions. Results are based on synthetic TTS audio and the zero‑shot setting, so expect gaps versus tuned models and real conversational speech.

Problem Statement

Spoken QA in healthcare needs deep understanding across long audio. Typical pipelines transcribe speech (ASR) then run an LLM, which adds compute cost and compounds errors. The paper asks: can a single end‑to‑end speech model perform zero‑shot multiple‑choice medical QA with less compute and similar accuracy?

Main Contribution

An audio→text entailment method to do zero‑shot multiple‑choice SQA from speech.

A synthetic medical spoken QA benchmark (≈47h audio, 6,545 items) derived from MMLU, MedQA, MedMCQA.

Key Findings

End‑to‑end zero‑shot entailment uses far fewer parameters while matching accuracy.

NumbersUp to 14.7× fewer params; +0.5% avg accuracy

Practical UseUse E2E entailment to cut model size and hardware needs when you must deploy lightweight SQA systems without fine‑tuning.

Evidence RefAbstract; Section 4.2

Small contrastive audio model (CLAP) performs competitively despite tiny size.

NumbersCLAP: 153M193M params; 14.7×–44.3× smaller vs some cascades

Practical UseTry compact contrastive audio encoders (CLAP) for low‑cost SQA prototypes before scaling to big LLMs.

Evidence RefSection 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyE2E models: ~2427.5% avg (varies by model and task)Cascade best (LLaMa2 13B + Whisper Medium): 38.3% avgE2E ≈ up to −11% vs top cascade; comparable to some smaller cascadesAcross 8 medical SQA tasks (MMLU subsets, MedQA, MedMCQA)Table 3; Table 1Section 4.1; Section 4.2
Resource comparison (parameters)E2E up to 14.7× fewer params than a cascade of 1.3B LLM + 1.55B ASRCascade: 1.3B LLM + 1.55B ASRUp to 14.7× reductionOverall benchmarkAbstract; Section 4.2Abstract; Section 4.2

What To Try In 7 Days

Run CLAP entailment on a small set of medical audio MCQs to prototype low‑cost SQA.

Compare Whisper Medium E2E entailment vs your ASR+LLM cascade by measuring task accuracy not just WER.

Synthesize a small TTS medical QA set and test model sensitivity to speaker variety and audio length.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Audio is synthetic TTS with limited speaker diversity; not equal to real clinical speech.

Zero‑shot evaluation only; no fine‑tuning experiments shown.

When Not To Use

For high‑stakes clinical QA where labeled real‑world speech and fine‑tuning are possible.

When conversational context or multi‑turn dialogue matters.

Failure Modes

Performance drops on real, noisy, or accent‑diverse speech due to synthetic training audio.

Encoder misalignment causing SpeechGPT and similar models to underperform in speech inputs.

Core Entities

Models

Whisper SmallWhisper MediumWhisper Large V2CLAP (base, fused, large general)PengiSpeechGPTHuBERTwav2vec2WavLMData2VecPhi 1.5LLaMa 2 (7B, 13B)

Metrics

AccuracyWord Error Rate (WER)

Datasets

SpokenMedicalQA (synthetic)MMLU (6 healthcare subjects subset)MedQA (test set)MedMCQA (validation used as test)

Benchmarks

New medical spoken multiple‑choice benchmark (8 tasks, 6,545 items, 47h41m)

Context Entities

Models

BloomLLaMa 2SpeechT5

Datasets

Clotho-AQASpoken-SQuADLibriSQA