Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
0
Why It Matters For Business
RPE offers a low-cost way to make LLM-based health assistants safer, more explainable, and better aligned with user context — improving trust and reducing legal/ethical risk without full model retraining.
Summary TLDR
RHealthTwin is a modular framework for consumer-facing health "digital twins" that wraps an LLM with a Responsible Prompt Engine (RPE). RPE extracts structured slots (query, context, role, tone, filters, justification, examples) from multimodal inputs and builds system + user prompts that guide an LLM. In evaluations on four public datasets (mental health, clinical dialog, nutrition, wearable QA), RPE improved reference metrics (BLEU=0.41, ROUGE-L=0.63, BERTScore=0.89), scored high on factuality/context alignment (FS≈4.2/5, CAS≈4.1/5), and yielded strong ethical compliance (ICS>0.94, WRR>0.92) using GPT-4 as automated judge. The system is a prototype: it helps reduce hallucination and enacts
Problem Statement
LLM-driven digital twins can aid everyday well-being but risk hallucination, bias, unclear reasoning, and unsafe advice. Existing digital-twin work focuses on clinical accuracy or simulation but lacks integrated ethical controls, multimodal grounding, and continuous personalization for consumer health. RHealthTwin aims to operationalize WHO ethical principles in a prompt governance layer that dynamically structures inputs, enforces safety filters, and grounds outputs for multimodal, consumer-facing use.
Main Contribution
RHealthTwin framework: a modular pipeline to build multimodal, personalized well-being digital twins with feedback-driven adaptation.
Responsible Prompt Engine (RPE): slot-based prompt construction (UQ, CP, J, ROLE, TONE, FILT, FE) that converts unstructured multimodal inputs into system + user prompts.
Operational evaluation: automated, model-agnostic tests on four public datasets (MentalChat16k, MTS-Dialog v3, NutriBench v2, SensorQA) using both reference metrics and GPT-4 as judge.
Practical algorithms: concrete slot-extraction, template wrapping, multimodal RAG optionality, and a feedback-to-slot update loop.
Key Findings
RPE improves lexical and semantic reference scores on datasets with ground-truth responses.
RPE greatly increases instruction-following and WHO-aligned ethical compliance versus baselines.
RPE improves perceived factuality and context grounding under GPT-4 evaluation.
Results
BLEU
ROUGE-L
BERTScore
Instructional Compliance Score (ICS)
WHO-aligned Responsibility Rubric (WRR)
Factuality Score (FS)
Contextual Appropriateness Score (CAS)
Who Should Care
What To Try In 7 Days
Implement slot-based prompt templates (query, context, role, tone, filters) around your existing LLM API and compare outputs to current prompts.
Add a lightweight justification/RAG step that prepends 1–3 retrieved evidence snippets to reduce hallucinations.
Log prompt slots and user feedback to drive incremental slot-template updates and track ethical compliance.
Agent Features
Memory
- Session-level chat history with slot templates
- Feedback-driven slot updates (short-term personalization)
Tool Use
- Third-party API actions via Multimodal Agent (e.g., reminders, search)
Frameworks
- Responsible Prompt Engine (RPE)
- Multimodal RAG
Optimization Features
System Optimization
- Template-driven prompt generation for reproducible behavior
Training Optimization
- Instruction-time tuning via structured system instructions (no full fine-tune required)
Inference Optimization
- Slot-based prompt construction to reduce ambiguous queries
- Optional RAG to trade retrieval for hallucination reduction
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation relies heavily on GPT-4 as an automated judge, which risks evaluator bias and may not replace human clinical review.
- RPE depends on predefined templates and slot rules, limiting flexibility with ambiguous or novel user inputs.
- Current tests focus on English datasets; cross-lingual and cultural generalization is untested.
- Prototype-level system: not clinically validated for high-stakes medical decision-making.
When Not To Use
- High-stakes clinical diagnosis, emergency medicine, or where regulatory medical advice is required.
- Systems requiring formal clinical validation or legal liability guarantees.
- Settings where users or regulators demand full explainable audit trails beyond prompt-level citations.
Failure Modes
- Template extraction failures or missing slot values leading to incomplete prompts and unsafe outputs.
- Users crafting inputs that bypass filters (adversarial or unconstrained text).
- Hallucinations when RAG/evidence retrieval is disabled or returns low-quality snippets.
- Over-reliance on automated GPT-4 evaluation could mask real-world safety gaps.
Core Entities
Models
- GPT-4
- Gemini Flash 2.5
- Gemini Pro
- LLaMA 4
- QwenV2
- BioMistral-7B
- Asclepius-7B
- LLaMA3-8B-Instruct
- Qwen2-7B-Instruct
- Qwen-VL
- Mistral-7B
- GPT-3.5
Metrics
- BLEU
- ROUGE-L
- BERTScore
- Factuality Score (FS)
- Contextual Appropriateness Score (CAS)
- Instructional Compliance Score (ICS)
- WHO-aligned Responsibility Rubric (WRR)
Datasets
- MentalChat16k
- MTS-Dialog v3
- NutriBench v2
- SensorQA
Benchmarks
- MentalChat16k
- MTS-Dialog v3
- NutriBench v2
- SensorQA

