Slot-based Responsible Prompt Engine (RPE) for safer, explainable multimodal health digital twins

Overview

Decision SnapshotNeeds Validation

The approach is practically useful as a prompt-governance layer that improves safety and alignment without heavy retraining, but evidence relies on automated GPT-4 judging and prototype experiments.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Rahatara Ferdousi, M Anwar Hossain

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RPE offers a low-cost way to make LLM-based health assistants safer, more explainable, and better aligned with user context — improving trust and reducing legal/ethical risk without full model retraining.

Who Should Care

CTO Product Manager ML Engineer Founder

Summary TLDR

RHealthTwin is a modular framework for consumer-facing health "digital twins" that wraps an LLM with a Responsible Prompt Engine (RPE). RPE extracts structured slots (query, context, role, tone, filters, justification, examples) from multimodal inputs and builds system + user prompts that guide an LLM. In evaluations on four public datasets (mental health, clinical dialog, nutrition, wearable QA), RPE improved reference metrics (BLEU=0.41, ROUGE-L=0.63, BERTScore=0.89), scored high on factuality/context alignment (FS≈4.2/5, CAS≈4.1/5), and yielded strong ethical compliance (ICS>0.94, WRR>0.92) using GPT-4 as automated judge. The system is a prototype: it helps reduce hallucination and enacts

Problem Statement

LLM-driven digital twins can aid everyday well-being but risk hallucination, bias, unclear reasoning, and unsafe advice. Existing digital-twin work focuses on clinical accuracy or simulation but lacks integrated ethical controls, multimodal grounding, and continuous personalization for consumer health. RHealthTwin aims to operationalize WHO ethical principles in a prompt governance layer that dynamically structures inputs, enforces safety filters, and grounds outputs for multimodal, consumer-facing use.

Main Contribution

RHealthTwin framework: a modular pipeline to build multimodal, personalized well-being digital twins with feedback-driven adaptation.

Responsible Prompt Engine (RPE): slot-based prompt construction (UQ, CP, J, ROLE, TONE, FILT, FE) that converts unstructured multimodal inputs into system + user prompts.

Key Findings

RPE improves lexical and semantic reference scores on datasets with ground-truth responses.

NumbersBLEU=0.41; ROUGE-L=0.63; BERTScore=0.89 (reported aggregate)

Practical UseUse slot-based prompt engineering to boost alignment with human-written answers without model fine-tuning; this is a low-cost alternative to retraining.

Evidence RefAbstract; Section IV.F; Table VII

RPE greatly increases instruction-following and WHO-aligned ethical compliance versus baselines.

NumbersICS >0.94 and WRR >0.92 across datasets; MTS-Dialog: ICS=0.947 vs 0.816 (instruction-tuned)

Practical UseEmbed explicit role, tone, filter, and justification slots in prompts to enforce safety and tone constraints for consumer health assistants.

Evidence RefAbstract; Section IV.F (WRR/ICS results); Figure 10

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU	0.41	—	—	Reference-enabled datasets (MentalChat16k, MTS-Dialog)	Aggregate reference-based result reported for RPE	Abstract; Section IV.F; Table VII
ROUGE-L	0.63	—	—	Reference-enabled datasets	Aggregate reference-based result reported for RPE	Abstract; Section IV.F; Table VII

What To Try In 7 Days

Implement slot-based prompt templates (query, context, role, tone, filters) around your existing LLM API and compare outputs to current prompts.

Add a lightweight justification/RAG step that prepends 1–3 retrieved evidence snippets to reduce hallucinations.

Log prompt slots and user feedback to drive incremental slot-template updates and track ethical compliance.

Agent Features

Memory

Session-level chat history with slot templatesFeedback-driven slot updates (short-term personalization)

Tool Use

Third-party API actions via Multimodal Agent (e.g., reminders, search)

Frameworks

Responsible Prompt Engine (RPE)Multimodal RAG

Optimization Features

System Optimization

Template-driven prompt generation for reproducible behavior

Training Optimization

Instruction-time tuning via structured system instructions (no full fine-tune required)

Inference Optimization

Slot-based prompt construction to reduce ambiguous queriesOptional RAG to trade retrieval for hallucination reduction

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/turna1/ResponsibleHealthTwin-RHT https://huggingface.co/spaces/Rahatara/WellebeingDT

Data URLs

https://github.com/ChiaPatricia/MentalChat16K_Main https://github.com/abachaa/MTS-Dialog https://huggingface.co/datasets/dongx1997/NutriBench https://github.com/benjamin-reichman/SensorQA

Risks & Boundaries

Limitations

Evaluation relies heavily on GPT-4 as an automated judge, which risks evaluator bias and may not replace human clinical review.

RPE depends on predefined templates and slot rules, limiting flexibility with ambiguous or novel user inputs.

When Not To Use

High-stakes clinical diagnosis, emergency medicine, or where regulatory medical advice is required.

Systems requiring formal clinical validation or legal liability guarantees.

Failure Modes

Template extraction failures or missing slot values leading to incomplete prompts and unsafe outputs.

Users crafting inputs that bypass filters (adversarial or unconstrained text).

Core Entities

Models

GPT-4Gemini Flash 2.5Gemini ProLLaMA 4QwenV2BioMistral-7BAsclepius-7BLLaMA3-8B-InstructQwen2-7B-InstructQwen-VLMistral-7BGPT-3.5

Metrics

BLEUROUGE-LBERTScoreFactuality Score (FS)Contextual Appropriateness Score (CAS)Instructional Compliance Score (ICS)WHO-aligned Responsibility Rubric (WRR)

Datasets

MentalChat16kMTS-Dialog v3NutriBench v2SensorQA

Benchmarks

MentalChat16kMTS-Dialog v3NutriBench v2SensorQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RPE improves lexical and semantic reference scores on datasets with ground-truth responses.

RPE greatly increases instruction-following and WHO-aligned ethical compliance versus baselines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding