Overview
Dataset is a clear, domain-specific contribution with strong internal validation, but no public release or external replication is documented.
Citations0
Evidence Strength0.70
Confidence0.86
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
EmplifAI fills a Japanese, medically focused empathy data gap, enabling faster prototyping of patient-facing assistants that better recognize nuanced emotions during chronic care.
Who Should Care
Summary TLDR
EmplifAI is a Japanese dataset of 280 medically contextual situations and 4,125 two-turn patient–supporter dialogues labeled with 28 emotions (adapted from GoEmotions). The authors validate the emotion taxonomy via reverse-engineering with multiple LLMs (all BERTScore F1 ≥ 0.83), fine-tune small Japanese LLMs on the dataset (observable gains by automated LLM-judge), and compare LLM-as-a-judge scores with human ratings to expose alignment gaps in subjective empathy evaluation.
Problem Statement
Existing Japanese empathy datasets miss medical context, underrepresent positive or subtle emotions, and use overlapping or imbalanced emotion labels. This gap limits building empathetic agents for long-term chronic care where mixed and shifting emotions matter.
Main Contribution
EmplifAI dataset: 280 situation prompts and 4,125 two-turn patient–supporter dialogues in Japanese across 28 fine-grained emotions.
Japanese translation and validation of GoEmotions taxonomy for medical contexts.
Key Findings
Dataset scale and balance
Taxonomy validity via LLMs
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 280 situations; 4,125 dialogues | — | — | EmplifAI (all) | Two rounds crowdsourcing and manual review produced 280 situations and 4,125 two-turn dialogues | Section 3.5 |
| Emotion prediction (BERTScore F1) | >= 0.83 | — | — | reverse-engineering on 4,125 pairs | All five evaluated LLMs achieved BERTScore F1 ≥ 0.83 | Section 4.2; Table 2 |
What To Try In 7 Days
Load the 280 situations and sample 100 pairs to inspect style and labels.
Fine-tune a small Japanese LLM on 1–3 epochs and compare fluency/empathy with and without EmplifAI.
Run simple emotion-prediction tests with BERTScore and FastText to validate label alignment.
Reproducibility
Risks & Boundaries
Limitations
No explicit public dataset or code URL provided, so reuse requires contacting authors.
Data collected via imagined dialogues by crowdworkers; may not match real clinical conversations.
When Not To Use
If you need real clinical transcripts rather than imagined patient–supporter role-play.
For non-Japanese languages or culturally different clinical settings without revalidation.
Failure Modes
LLM-as-a-Judge can overrate 'correct' or solution-focused responses that humans find emotionally inappropriate.
Fine-tuned models may produce overly solution-focused replies that increase patient pressure.

