Slot-based Responsible Prompt Engine (RPE) for safer, explainable multimodal health digital twins

June 10, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

0

Authors

Rahatara Ferdousi, M Anwar Hossain

Links

Abstract / PDF

Why It Matters For Business

RPE offers a low-cost way to make LLM-based health assistants safer, more explainable, and better aligned with user context — improving trust and reducing legal/ethical risk without full model retraining.

Summary TLDR

RHealthTwin is a modular framework for consumer-facing health "digital twins" that wraps an LLM with a Responsible Prompt Engine (RPE). RPE extracts structured slots (query, context, role, tone, filters, justification, examples) from multimodal inputs and builds system + user prompts that guide an LLM. In evaluations on four public datasets (mental health, clinical dialog, nutrition, wearable QA), RPE improved reference metrics (BLEU=0.41, ROUGE-L=0.63, BERTScore=0.89), scored high on factuality/context alignment (FS≈4.2/5, CAS≈4.1/5), and yielded strong ethical compliance (ICS>0.94, WRR>0.92) using GPT-4 as automated judge. The system is a prototype: it helps reduce hallucination and enacts

Problem Statement

LLM-driven digital twins can aid everyday well-being but risk hallucination, bias, unclear reasoning, and unsafe advice. Existing digital-twin work focuses on clinical accuracy or simulation but lacks integrated ethical controls, multimodal grounding, and continuous personalization for consumer health. RHealthTwin aims to operationalize WHO ethical principles in a prompt governance layer that dynamically structures inputs, enforces safety filters, and grounds outputs for multimodal, consumer-facing use.

Main Contribution

RHealthTwin framework: a modular pipeline to build multimodal, personalized well-being digital twins with feedback-driven adaptation.

Responsible Prompt Engine (RPE): slot-based prompt construction (UQ, CP, J, ROLE, TONE, FILT, FE) that converts unstructured multimodal inputs into system + user prompts.

Operational evaluation: automated, model-agnostic tests on four public datasets (MentalChat16k, MTS-Dialog v3, NutriBench v2, SensorQA) using both reference metrics and GPT-4 as judge.

Practical algorithms: concrete slot-extraction, template wrapping, multimodal RAG optionality, and a feedback-to-slot update loop.

Key Findings

RPE improves lexical and semantic reference scores on datasets with ground-truth responses.

NumbersBLEU=0.41; ROUGE-L=0.63; BERTScore=0.89 (reported aggregate)

RPE greatly increases instruction-following and WHO-aligned ethical compliance versus baselines.

NumbersICS >0.94 and WRR >0.92 across datasets; MTS-Dialog: ICS=0.947 vs 0.816 (instruction-tuned)

RPE improves perceived factuality and context grounding under GPT-4 evaluation.

NumbersFactuality Score (FS) ≈4.2/5; Contextual Appropriateness (CAS) ≈4.1/5

Results

BLEU

Value0.41

ROUGE-L

Value0.63

BERTScore

Value0.89

Instructional Compliance Score (ICS)

Value>0.94

Baselineinstruction-tuned: 0.816 (MTS-Dialog example)

WHO-aligned Responsibility Rubric (WRR)

Value>0.92

Baselineinstruction-tuned: 0.775 (MTS-Dialog example)

Factuality Score (FS)

Value≈4.2/5

Contextual Appropriateness Score (CAS)

Value≈4.1/5

Who Should Care

What To Try In 7 Days

Implement slot-based prompt templates (query, context, role, tone, filters) around your existing LLM API and compare outputs to current prompts.

Add a lightweight justification/RAG step that prepends 1–3 retrieved evidence snippets to reduce hallucinations.

Log prompt slots and user feedback to drive incremental slot-template updates and track ethical compliance.

Agent Features

Memory

  • Session-level chat history with slot templates
  • Feedback-driven slot updates (short-term personalization)

Tool Use

  • Third-party API actions via Multimodal Agent (e.g., reminders, search)

Frameworks

  • Responsible Prompt Engine (RPE)
  • Multimodal RAG

Optimization Features

System Optimization

  • Template-driven prompt generation for reproducible behavior

Training Optimization

  • Instruction-time tuning via structured system instructions (no full fine-tune required)

Inference Optimization

  • Slot-based prompt construction to reduce ambiguous queries
  • Optional RAG to trade retrieval for hallucination reduction

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation relies heavily on GPT-4 as an automated judge, which risks evaluator bias and may not replace human clinical review.
  • RPE depends on predefined templates and slot rules, limiting flexibility with ambiguous or novel user inputs.
  • Current tests focus on English datasets; cross-lingual and cultural generalization is untested.
  • Prototype-level system: not clinically validated for high-stakes medical decision-making.

When Not To Use

  • High-stakes clinical diagnosis, emergency medicine, or where regulatory medical advice is required.
  • Systems requiring formal clinical validation or legal liability guarantees.
  • Settings where users or regulators demand full explainable audit trails beyond prompt-level citations.

Failure Modes

  • Template extraction failures or missing slot values leading to incomplete prompts and unsafe outputs.
  • Users crafting inputs that bypass filters (adversarial or unconstrained text).
  • Hallucinations when RAG/evidence retrieval is disabled or returns low-quality snippets.
  • Over-reliance on automated GPT-4 evaluation could mask real-world safety gaps.

Core Entities

Models

  • GPT-4
  • Gemini Flash 2.5
  • Gemini Pro
  • LLaMA 4
  • QwenV2
  • BioMistral-7B
  • Asclepius-7B
  • LLaMA3-8B-Instruct
  • Qwen2-7B-Instruct
  • Qwen-VL
  • Mistral-7B
  • GPT-3.5

Metrics

  • BLEU
  • ROUGE-L
  • BERTScore
  • Factuality Score (FS)
  • Contextual Appropriateness Score (CAS)
  • Instructional Compliance Score (ICS)
  • WHO-aligned Responsibility Rubric (WRR)

Datasets

  • MentalChat16k
  • MTS-Dialog v3
  • NutriBench v2
  • SensorQA

Benchmarks

  • MentalChat16k
  • MTS-Dialog v3
  • NutriBench v2
  • SensorQA