SimSUM — 10K simulated EHR encounters linking tabular features and LLM-written clinical notes for multimodal CIE

Overview

Decision SnapshotNeeds Validation

SimSUM is a useful research playground with high internal validity; do not use its models or outputs for clinical decision-making.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 20%

Novelty: 60%

Authors

Paloma Rabaey, Stefan Heytens, Thomas Demeester

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SimSUM provides a safe, fast playground to build and test multimodal clinical extraction methods without patient data; integrating text with tabular EHR features improves extraction accuracy for subtle symptoms.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

SimSUM is a synthetic benchmark of 10,000 simulated primary-care patient encounters for respiratory disease. Each record pairs 16 expert-defined tabular variables (sampled from a Bayesian network) with two LLM-generated clinical notes (normal and compact) and span-level symptom annotations. The authors validate note quality with 5 GPs, report high LLM consistency and clinical accuracy, provide automated symptom-span extraction (precision ≈94–94.5%, recall ≈94.5–98.8%), and release baseline symptom predictors showing text greatly boosts extraction F1 versus tabular-only baselines. SimSUM is meant for research and prototyping, not for training production clinical models.

Problem Statement

Open EHR datasets rarely provide explicit, controllable links between encoded tabular features and concepts in clinical notes. That makes it hard to study methods that combine domain knowledge and text for clinical information extraction. SimSUM fills this gap with a controlled simulated dataset that links structured variables and textual symptom mentions via an expert-defined Bayesian network.

Main Contribution

A public simulated dataset (SimSUM) of 10,000 single-visit EHR records pairing 16 tabular features with two LLM-generated clinical notes per record.

A reproducible generation pipeline: expert-defined Bayesian network + GPT-4o prompts producing normal and compact notes.

Key Findings

Dataset size and design

Numbers10,000 records; 16 tabular features per record

Practical UseYou can prototype multimodal methods at scale without using real patient data.

Evidence RefSection 2; Fig.3

LLM note consistency and clinical accuracy

NumbersConsistency mean 4.69/5; Clinical accuracy mean 4.92/5

Practical UseLLM-generated notes reliably reflect the tabular prompt, so text and table labels are meaningfully aligned for extraction tasks.

Evidence RefSection 3.1; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Expert evaluation - consistency (normal notes)	4.69 / 5 mean	—	—	30 notes, 5 GPs	Table 2; Section 3.1	Table 2
Accuracy	4.92 / 5 mean	—	—	30 notes, 5 GPs	Table 2; Section 3.1	Table 2

What To Try In 7 Days

Download SimSUM and run the provided baseline notebook from the GitHub repo.

Reproduce the neural-text baseline and compare F1 with/without tabular features.

Use the released prompts to generate compact notes and test robustness of your extractor.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/prabaey/SimSUM

Data URLs

https://github.com/prabaey/SimSUM

Risks & Boundaries

Limitations

Fully synthetic: tabular data from one expert BN and notes generated by GPT-4o, so patterns do not equal real EHR variability.

Not suitable for training production clinical models or for regulatory use.

When Not To Use

To train models intended for live clinical deployment or patient care.

To evaluate models requiring real-world EHR noise, billing artifacts, or time series.

Failure Modes

LLM may invent extra symptoms or irrelevant tests, violating prompt constraints.

Compact/abbreviated notes produce lower extraction recall and readability.

Core Entities

Models

GPT-4oBioLORD-2023XGBoostBayesian network (expert-defined)

Metrics

F1precisionrecallmacro F1

Datasets

SimSUMMIMIC-IIIMIMIC-IVBioDEXTCGA-Reports

Benchmarks

SimSUM

Context Entities

Models

Clinical sentence embeddings (BioLORD)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and design

LLM note consistency and clinical accuracy

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding