Slot-based Responsible Prompt Engine (RPE) for safer, explainable multimodal health digital twins

June 10, 20258 min

Overview

Decision SnapshotNeeds Validation

The approach is practically useful as a prompt-governance layer that improves safety and alignment without heavy retraining, but evidence relies on automated GPT-4 judging and prototype experiments.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Rahatara Ferdousi, M Anwar Hossain

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RPE offers a low-cost way to make LLM-based health assistants safer, more explainable, and better aligned with user context — improving trust and reducing legal/ethical risk without full model retraining.

Who Should Care

Summary TLDR

RHealthTwin is a modular framework for consumer-facing health "digital twins" that wraps an LLM with a Responsible Prompt Engine (RPE). RPE extracts structured slots (query, context, role, tone, filters, justification, examples) from multimodal inputs and builds system + user prompts that guide an LLM. In evaluations on four public datasets (mental health, clinical dialog, nutrition, wearable QA), RPE improved reference metrics (BLEU=0.41, ROUGE-L=0.63, BERTScore=0.89), scored high on factuality/context alignment (FS≈4.2/5, CAS≈4.1/5), and yielded strong ethical compliance (ICS>0.94, WRR>0.92) using GPT-4 as automated judge. The system is a prototype: it helps reduce hallucination and enacts

Problem Statement

LLM-driven digital twins can aid everyday well-being but risk hallucination, bias, unclear reasoning, and unsafe advice. Existing digital-twin work focuses on clinical accuracy or simulation but lacks integrated ethical controls, multimodal grounding, and continuous personalization for consumer health. RHealthTwin aims to operationalize WHO ethical principles in a prompt governance layer that dynamically structures inputs, enforces safety filters, and grounds outputs for multimodal, consumer-facing use.

Main Contribution

RHealthTwin framework: a modular pipeline to build multimodal, personalized well-being digital twins with feedback-driven adaptation.

Responsible Prompt Engine (RPE): slot-based prompt construction (UQ, CP, J, ROLE, TONE, FILT, FE) that converts unstructured multimodal inputs into system + user prompts.

Key Findings

RPE improves lexical and semantic reference scores on datasets with ground-truth responses.

NumbersBLEU=0.41; ROUGE-L=0.63; BERTScore=0.89 (reported aggregate)

Practical UseUse slot-based prompt engineering to boost alignment with human-written answers without model fine-tuning; this is a low-cost alternative to retraining.

Evidence RefAbstract; Section IV.F; Table VII

RPE greatly increases instruction-following and WHO-aligned ethical compliance versus baselines.

NumbersICS >0.94 and WRR >0.92 across datasets; MTS-Dialog: ICS=0.947 vs 0.816 (instruction-tuned)

Practical UseEmbed explicit role, tone, filter, and justification slots in prompts to enforce safety and tone constraints for consumer health assistants.

Evidence RefAbstract; Section IV.F (WRR/ICS results); Figure 10

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BLEU0.41Reference-enabled datasets (MentalChat16k, MTS-Dialog)Aggregate reference-based result reported for RPEAbstract; Section IV.F; Table VII
ROUGE-L0.63Reference-enabled datasetsAggregate reference-based result reported for RPEAbstract; Section IV.F; Table VII

What To Try In 7 Days

Implement slot-based prompt templates (query, context, role, tone, filters) around your existing LLM API and compare outputs to current prompts.

Add a lightweight justification/RAG step that prepends 1–3 retrieved evidence snippets to reduce hallucinations.

Log prompt slots and user feedback to drive incremental slot-template updates and track ethical compliance.

Agent Features

Memory
Session-level chat history with slot templatesFeedback-driven slot updates (short-term personalization)
Tool Use
Third-party API actions via Multimodal Agent (e.g., reminders, search)
Frameworks
Responsible Prompt Engine (RPE)Multimodal RAG

Optimization Features

System Optimization
Template-driven prompt generation for reproducible behavior
Training Optimization
Instruction-time tuning via structured system instructions (no full fine-tune required)
Inference Optimization
Slot-based prompt construction to reduce ambiguous queriesOptional RAG to trade retrieval for hallucination reduction

Reproducibility

Risks & Boundaries

Limitations

Evaluation relies heavily on GPT-4 as an automated judge, which risks evaluator bias and may not replace human clinical review.

RPE depends on predefined templates and slot rules, limiting flexibility with ambiguous or novel user inputs.

When Not To Use

High-stakes clinical diagnosis, emergency medicine, or where regulatory medical advice is required.

Systems requiring formal clinical validation or legal liability guarantees.

Failure Modes

Template extraction failures or missing slot values leading to incomplete prompts and unsafe outputs.

Users crafting inputs that bypass filters (adversarial or unconstrained text).

Core Entities

Models

GPT-4Gemini Flash 2.5Gemini ProLLaMA 4QwenV2BioMistral-7BAsclepius-7BLLaMA3-8B-InstructQwen2-7B-InstructQwen-VLMistral-7BGPT-3.5

Metrics

BLEUROUGE-LBERTScoreFactuality Score (FS)Contextual Appropriateness Score (CAS)Instructional Compliance Score (ICS)WHO-aligned Responsibility Rubric (WRR)

Datasets

MentalChat16kMTS-Dialog v3NutriBench v2SensorQA

Benchmarks

MentalChat16kMTS-Dialog v3NutriBench v2SensorQA