EmplifAI: 4,125 Japanese medical two‑turn dialogues labeled with 28 fine-grained emotions

January 15, 20266 min

Overview

Decision SnapshotNeeds Validation

Dataset is a clear, domain-specific contribution with strong internal validation, but no public release or external replication is documented.

Citations0

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Wan Jou She, Lis Kanashiro Pereira, Fei Cheng, Sakiko Yahata, Panote Siriaraya, Eiji Aramaki

Links

Abstract / PDF

Why It Matters For Business

EmplifAI fills a Japanese, medically focused empathy data gap, enabling faster prototyping of patient-facing assistants that better recognize nuanced emotions during chronic care.

Who Should Care

Summary TLDR

EmplifAI is a Japanese dataset of 280 medically contextual situations and 4,125 two-turn patient–supporter dialogues labeled with 28 emotions (adapted from GoEmotions). The authors validate the emotion taxonomy via reverse-engineering with multiple LLMs (all BERTScore F1 ≥ 0.83), fine-tune small Japanese LLMs on the dataset (observable gains by automated LLM-judge), and compare LLM-as-a-judge scores with human ratings to expose alignment gaps in subjective empathy evaluation.

Problem Statement

Existing Japanese empathy datasets miss medical context, underrepresent positive or subtle emotions, and use overlapping or imbalanced emotion labels. This gap limits building empathetic agents for long-term chronic care where mixed and shifting emotions matter.

Main Contribution

EmplifAI dataset: 280 situation prompts and 4,125 two-turn patient–supporter dialogues in Japanese across 28 fine-grained emotions.

Japanese translation and validation of GoEmotions taxonomy for medical contexts.

Key Findings

Dataset scale and balance

Numbers280 situations; 4,125 two-turn dialogues; ~1415 dialogues per emotion-situation

Practical UseYou can fine-tune compact Japanese LLMs on a medically focused, relatively balanced empathy dataset without needing large synthetic data immediately.

Evidence RefSection 3.5, dataset stats

Taxonomy validity via LLMs

NumbersBERTScore F1 ≥ 0.83 across five LLMs

Practical UseThe 28 adapted emotion labels are semantically distinguishable in model tests—suitable as training targets for emotion-aware dialogue models.

Evidence RefSection 4.2; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size280 situations; 4,125 dialoguesEmplifAI (all)Two rounds crowdsourcing and manual review produced 280 situations and 4,125 two-turn dialoguesSection 3.5
Emotion prediction (BERTScore F1)>= 0.83reverse-engineering on 4,125 pairsAll five evaluated LLMs achieved BERTScore F1 ≥ 0.83Section 4.2; Table 2

What To Try In 7 Days

Load the 280 situations and sample 100 pairs to inspect style and labels.

Fine-tune a small Japanese LLM on 1–3 epochs and compare fluency/empathy with and without EmplifAI.

Run simple emotion-prediction tests with BERTScore and FastText to validate label alignment.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

No explicit public dataset or code URL provided, so reuse requires contacting authors.

Data collected via imagined dialogues by crowdworkers; may not match real clinical conversations.

When Not To Use

If you need real clinical transcripts rather than imagined patient–supporter role-play.

For non-Japanese languages or culturally different clinical settings without revalidation.

Failure Modes

LLM-as-a-Judge can overrate 'correct' or solution-focused responses that humans find emotionally inappropriate.

Fine-tuned models may produce overly solution-focused replies that increase patient pressure.

Core Entities

Models

GPT-o3-proDeepSeek-distilled-Qwen-32bLLM-jp-3.1-13b-instruct4Llama-3-Swallow-8b-Instruct-v0.1MedLlama3-JPGemini-2.5-Flash

Metrics

BERTScoreFastText (cosine similarity)5-point Likert metrics: content_comprehensibilitygeneral_empathyemotion_specific_empathyconsistencyfluencyharmlessnesssense_of_security

Datasets

EmplifAIGoEmotionsEmpatheticDialoguesSTUDIESCALLSKokoroChat