Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
EmplifAI fills a Japanese, medically focused empathy data gap, enabling faster prototyping of patient-facing assistants that better recognize nuanced emotions during chronic care.
Summary TLDR
EmplifAI is a Japanese dataset of 280 medically contextual situations and 4,125 two-turn patient–supporter dialogues labeled with 28 emotions (adapted from GoEmotions). The authors validate the emotion taxonomy via reverse-engineering with multiple LLMs (all BERTScore F1 ≥ 0.83), fine-tune small Japanese LLMs on the dataset (observable gains by automated LLM-judge), and compare LLM-as-a-judge scores with human ratings to expose alignment gaps in subjective empathy evaluation.
Problem Statement
Existing Japanese empathy datasets miss medical context, underrepresent positive or subtle emotions, and use overlapping or imbalanced emotion labels. This gap limits building empathetic agents for long-term chronic care where mixed and shifting emotions matter.
Main Contribution
EmplifAI dataset: 280 situation prompts and 4,125 two-turn patient–supporter dialogues in Japanese across 28 fine-grained emotions.
Japanese translation and validation of GoEmotions taxonomy for medical contexts.
Evaluation pipelines: reverse-engineering emotion prediction (FastText + BERTScore), supervised fine-tuning baselines, and LLM-as-a-Judge vs human rater comparison.
Key Findings
Dataset scale and balance
Taxonomy validity via LLMs
Fine-tuning improves automated empathy metrics
LLM-as-a-Judge often correlates with humans but can fail
Results
Dataset size
Emotion prediction (BERTScore F1)
SFT
Human vs LLM-judge correlation (DeepSeek)
Who Should Care
What To Try In 7 Days
Load the 280 situations and sample 100 pairs to inspect style and labels.
Fine-tune a small Japanese LLM on 1–3 epochs and compare fluency/empathy with and without EmplifAI.
Run simple emotion-prediction tests with BERTScore and FastText to validate label alignment.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- No explicit public dataset or code URL provided, so reuse requires contacting authors.
- Data collected via imagined dialogues by crowdworkers; may not match real clinical conversations.
- Designed for chronic medical contexts in Japanese; limited generalization to other domains or cultures.
- Prompt and length differences: models produced longer replies than crowdworkers, which can bias evaluations.
When Not To Use
- If you need real clinical transcripts rather than imagined patient–supporter role-play.
- For non-Japanese languages or culturally different clinical settings without revalidation.
- When strict medical advice accuracy is required—EmplifAI focuses on empathetic style, not clinical correctness.
Failure Modes
- LLM-as-a-Judge can overrate 'correct' or solution-focused responses that humans find emotionally inappropriate.
- Fine-tuned models may produce overly solution-focused replies that increase patient pressure.
- Label ambiguity may remain for closely related emotions despite validation (some FastText scores were lower).
Core Entities
Models
- GPT-o3-pro
- DeepSeek-distilled-Qwen-32b
- LLM-jp-3.1-13b-instruct4
- Llama-3-Swallow-8b-Instruct-v0.1
- MedLlama3-JP
- Gemini-2.5-Flash
Metrics
- BERTScore
- FastText (cosine similarity)
- 5-point Likert metrics: content_comprehensibility
- general_empathy
- emotion_specific_empathy
- consistency
- fluency
- harmlessness
- sense_of_security
Datasets
- EmplifAI
- GoEmotions
- EmpatheticDialogues
- STUDIES
- CALLS
- KokoroChat

