Train LLMs to read 12‑lead ECGs and draft clinical reports using lightweight multimodal alignment

March 7, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Zhongwei Wan, Che Liu, Xin Wang, Chaofan Tao, Hui Shen, Jing Xiong, Rossella Arcucci, Huaxiu Yao, Mi Zhang

Links

Abstract / PDF

Why It Matters For Business

MEIT can automate first-draft ECG reports and speed clinician workflows; it uses small extra compute (LoRA + small ECG encoder) and public datasets so teams can prototype quickly.

Summary TLDR

MEIT is a practical pipeline that attaches a small ECG encoder and a lightweight concatenation fusion to existing open-source LLMs, then instruction-tunes them on paired ECG signals and reports. On two public ECG datasets (MIMIC-IV-ECG: 800K pairs, PTB-XL: 21K pairs) instruction-tuned LLMs beat smaller language models on automatic metrics, show better zero-shot transfer across datasets, maintain some robustness to added noise, and score reasonably against expert annotations. Code and benchmark are released.

Problem Statement

Generating clinical ECG reports from 12‑lead ECG waveforms is time-consuming and different from image-report tasks. Existing work focuses on classification, not free-text report generation. There is also no standardized benchmark to compare multimodal ECG→text methods.

Main Contribution

MEIT: a multimodal instruction-tuning pipeline that injects ECG embeddings into frozen LLMs via a concatenation-based attention fusion without adding new backbone parameters.

A large ECG report benchmark and four evaluation tasks: report quality, zero-shot transfer, robustness to signal noise, and alignment to expert annotations.

Extensive experiments over ten open-source LLM backbones (2.7B–70B scale) on MIMIC-IV-ECG (800K pairs) and PTB-XL (21K pairs); public code and prompt generation process released.

Key Findings

Instruction-tuned LLMs substantially outperform small pretrained language models on report-generation metrics.

NumbersExample: LLaMA-3-Instruct BLEU-4 0.61 vs GPT2-Large 0.476 on MIMIC-IV-ECG (Table 1)

The concatenated-fusion (MEIT) alignment beats other fusion designs for ECG+text.

NumbersBLEU-4: MEIT 0.543 > LLaVA 0.529 and Flamingo 0.527 (Table 5)

Instruction tuning on a large ECG dataset improves zero-shot transfer to a different hospital dataset.

NumbersZero-shot with instruction tuning on MIMIC → PTB-XL shows higher scores than no instruction tuning (Figure 2)

Models degrade with added ECG noise but some backbones are more robust.

NumbersPerformance drops as SNR decreases; Mistral family keeps relatively higher ROUGE-L and METEOR under noise (Figure 3 and

Generated reports score near clinicians on several human-judged axes but are not perfect.

NumbersLLaMA-3-Instruct scores: Medical terminology 4.52, Logical consistency 4.38, Completeness 4.01, Diagnostic accuracy 3.98

Results

BLEU-4 (MIMIC-IV-ECG)

Value0.61

BaselineGPT2-Large 0.476

BLEU-4 (PTB-XL)

Value0.467

BaselineGPT2-Large 0.32

BERTScore F1 (MIMIC-IV-ECG)

Value0.771

BaselineGPT2-Large 0.613

Accuracy

Value3.98

BaselineLLaMA-2-Instruct 3.6

Fusion method BLEU-4

Value0.543

BaselineFlamingo cross-attn 0.527

Who Should Care

What To Try In 7 Days

Run MEIT code on a held-out subset of MIMIC-IV-ECG to reproduce paper metrics.

Attach the lightweight ECG encoder + concatenated-fusion to an open LLM (e.g., LLaMA-2-7B) and LoRA-finetune for a few epochs with bf16.

Evaluate generated drafts with clinicians on a small sample; compare editing time vs manual reports.

Optimization Features

Token Efficiency

  • max sequence length 256 tokens for generation

Infra Optimization

  • A100 GPUs, 4-A100 training examples in timing table

Model Optimization

  • LoRA
  • freeze LLM backbone to reduce train cost

System Optimization

  • DeepSpeed used for larger models

Training Optimization

  • mixed-precision bf16
  • linear LR schedule with warmup
  • LoRA

Inference Optimization

  • frozen backbone reduces memory changes; inference cost still grows with model size
  • suggested future use of quantization/compression

Reproducibility

Data Urls

  • MIMIC-IV-ECG (public subset)
  • PTB-XL (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Generated reports can hallucinate and are not fully explainable; paper notes need for external, verified knowledge to improve safety.
  • Diagnostic accuracy is below expert level; not ready for unsupervised clinical decisions.
  • Performance depends on training data scale and domain match; PTB-XL (small) shows lower scores.

When Not To Use

  • Do not use as sole diagnostic tool or in high-risk clinical decisions without expert oversight.
  • Avoid deploying without local validation on devices/hospitals with different ECG protocols.

Failure Modes

  • Hallucinated diagnoses or incorrect causal claims in reports
  • Performance drop on noisy or out-of-distribution ECG recordings
  • Overconfidence in phrasing that may mislead non-expert readers

Core Entities

Models

  • LLaMA-1
  • LLaMA-2-Instruct
  • LLaMA-3-Instruct
  • Mistral
  • Mistral-Instruct
  • GPT-Neo
  • GPT-NeoX
  • GPT-J
  • BLOOM
  • OPT
  • GPT2-Medium
  • GPT2-Large
  • BART-Large
  • T5-Large

Metrics

  • BLEU-1
  • BLEU-2
  • BLEU-3
  • BLEU-4
  • METEOR
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • CIDEr-D
  • BERTScore

Datasets

  • MIMIC-IV-ECG
  • PTB-XL

Benchmarks

  • MEIT ECG report benchmark (4 tasks)