Overview
SimSUM is a useful research playground with high internal validity; do not use its models or outputs for clinical decision-making.
Citations1
Evidence Strength0.80
Confidence0.82
Risk Signals12
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 30%
Production readiness: 20%
Novelty: 60%
Why It Matters For Business
SimSUM provides a safe, fast playground to build and test multimodal clinical extraction methods without patient data; integrating text with tabular EHR features improves extraction accuracy for subtle symptoms.
Who Should Care
Summary TLDR
SimSUM is a synthetic benchmark of 10,000 simulated primary-care patient encounters for respiratory disease. Each record pairs 16 expert-defined tabular variables (sampled from a Bayesian network) with two LLM-generated clinical notes (normal and compact) and span-level symptom annotations. The authors validate note quality with 5 GPs, report high LLM consistency and clinical accuracy, provide automated symptom-span extraction (precision ≈94–94.5%, recall ≈94.5–98.8%), and release baseline symptom predictors showing text greatly boosts extraction F1 versus tabular-only baselines. SimSUM is meant for research and prototyping, not for training production clinical models.
Problem Statement
Open EHR datasets rarely provide explicit, controllable links between encoded tabular features and concepts in clinical notes. That makes it hard to study methods that combine domain knowledge and text for clinical information extraction. SimSUM fills this gap with a controlled simulated dataset that links structured variables and textual symptom mentions via an expert-defined Bayesian network.
Main Contribution
A public simulated dataset (SimSUM) of 10,000 single-visit EHR records pairing 16 tabular features with two LLM-generated clinical notes per record.
A reproducible generation pipeline: expert-defined Bayesian network + GPT-4o prompts producing normal and compact notes.
Key Findings
Dataset size and design
LLM note consistency and clinical accuracy
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Expert evaluation - consistency (normal notes) | 4.69 / 5 mean | — | — | 30 notes, 5 GPs | Table 2; Section 3.1 | Table 2 |
| Accuracy | 4.92 / 5 mean | — | — | 30 notes, 5 GPs | Table 2; Section 3.1 | Table 2 |
What To Try In 7 Days
Download SimSUM and run the provided baseline notebook from the GitHub repo.
Reproduce the neural-text baseline and compare F1 with/without tabular features.
Use the released prompts to generate compact notes and test robustness of your extractor.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Fully synthetic: tabular data from one expert BN and notes generated by GPT-4o, so patterns do not equal real EHR variability.
Not suitable for training production clinical models or for regulatory use.
When Not To Use
To train models intended for live clinical deployment or patient care.
To evaluate models requiring real-world EHR noise, billing artifacts, or time series.
Failure Modes
LLM may invent extra symptoms or irrelevant tests, violating prompt constraints.
Compact/abbreviated notes produce lower extraction recall and readability.

