SimSUM — 10K simulated EHR encounters linking tabular features and LLM-written clinical notes for multimodal CIE

September 13, 20248 min

Overview

Production Readiness

0.2

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

1

Authors

Paloma Rabaey, Stefan Heytens, Thomas Demeester

Links

Abstract / PDF

Why It Matters For Business

SimSUM provides a safe, fast playground to build and test multimodal clinical extraction methods without patient data; integrating text with tabular EHR features improves extraction accuracy for subtle symptoms.

Summary TLDR

SimSUM is a synthetic benchmark of 10,000 simulated primary-care patient encounters for respiratory disease. Each record pairs 16 expert-defined tabular variables (sampled from a Bayesian network) with two LLM-generated clinical notes (normal and compact) and span-level symptom annotations. The authors validate note quality with 5 GPs, report high LLM consistency and clinical accuracy, provide automated symptom-span extraction (precision ≈94–94.5%, recall ≈94.5–98.8%), and release baseline symptom predictors showing text greatly boosts extraction F1 versus tabular-only baselines. SimSUM is meant for research and prototyping, not for training production clinical models.

Problem Statement

Open EHR datasets rarely provide explicit, controllable links between encoded tabular features and concepts in clinical notes. That makes it hard to study methods that combine domain knowledge and text for clinical information extraction. SimSUM fills this gap with a controlled simulated dataset that links structured variables and textual symptom mentions via an expert-defined Bayesian network.

Main Contribution

A public simulated dataset (SimSUM) of 10,000 single-visit EHR records pairing 16 tabular features with two LLM-generated clinical notes per record.

A reproducible generation pipeline: expert-defined Bayesian network + GPT-4o prompts producing normal and compact notes.

Automatic span-level annotation of five symptoms (dyspnea, cough, pain, fever, nasal) using LLM-based extraction, plus human spot checks.

An expert evaluation (5 GPs, 30 notes) assessing consistency, realism and clinical accuracy, and baseline models (BN, XGBoost, neural text) reporting F1 baselines.

Key Findings

Dataset size and design

Numbers10,000 records; 16 tabular features per record

LLM note consistency and clinical accuracy

NumbersConsistency mean 4.69/5; Clinical accuracy mean 4.92/5

Automated span extraction quality (normal notes)

Numbersprecision 94.06% ; recall 98.78%

Automated span extraction quality (compact notes)

Numbersprecision 93.55% ; recall 94.50%

Text strongly improves symptom extraction

NumbersNeural-text F1 ≈0.96 (dyspnea/cough/nasal) vs BN-tab F1 ≈0.67–0.78

LLM consistency check and regeneration

Numbers52 of 10,000 notes flagged inconsistent and regenerated

Results

Expert evaluation - consistency (normal notes)

Value4.69 / 5 mean

Accuracy

Value4.92 / 5 mean

Span extraction (normal notes)

Valueprecision 94.06% ; recall 98.78%

Span extraction (compact notes)

Valueprecision 93.55% ; recall 94.50%

Symptom extraction F1 (neural-text, normal notes)

Valuedyspnea 0.9617; cough 0.9603; nasal 0.9628; pain 0.8143; fever 0.9096

BaselineBN-tab (all): dyspnea 0.737, cough 0.7816, nasal 0.7146, pain 0.2386, fever 0.4864

Who Should Care

What To Try In 7 Days

Download SimSUM and run the provided baseline notebook from the GitHub repo.

Reproduce the neural-text baseline and compare F1 with/without tabular features.

Use the released prompts to generate compact notes and test robustness of your extractor.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Fully synthetic: tabular data from one expert BN and notes generated by GPT-4o, so patterns do not equal real EHR variability.
  • Not suitable for training production clinical models or for regulatory use.
  • Single-visit snapshots only; no longitudinal or temporal EHR dynamics.
  • Compact notes use abbreviations that reduce extraction reliability and readability.
  • Span annotations rely on GPT-4o extraction; quality depends on LLM access.

When Not To Use

  • To train models intended for live clinical deployment or patient care.
  • To evaluate models requiring real-world EHR noise, billing artifacts, or time series.
  • When regulatory-grade performance on patient-facing tasks is required.

Failure Modes

  • LLM may invent extra symptoms or irrelevant tests, violating prompt constraints.
  • Compact/abbreviated notes produce lower extraction recall and readability.
  • BN parameter choices reflect a single expert and local practice; may not generalize.
  • Span extraction quality drops if using weaker local LLMs instead of GPT-4o.

Core Entities

Models

  • GPT-4o
  • BioLORD-2023
  • XGBoost
  • Bayesian network (expert-defined)

Metrics

  • F1
  • precision
  • recall
  • macro F1

Datasets

  • SimSUM
  • MIMIC-III
  • MIMIC-IV
  • BioDEX
  • TCGA-Reports

Benchmarks

  • SimSUM

Context Entities

Models

  • Clinical sentence embeddings (BioLORD)