EmplifAI: 4,125 Japanese medical two‑turn dialogues labeled with 28 fine-grained emotions

Overview

Decision SnapshotNeeds Validation

Dataset is a clear, domain-specific contribution with strong internal validation, but no public release or external replication is documented.

Citations0

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Wan Jou She, Lis Kanashiro Pereira, Fei Cheng, Sakiko Yahata, Panote Siriaraya, Eiji Aramaki

Links

Abstract / PDF

Why It Matters For Business

EmplifAI fills a Japanese, medically focused empathy data gap, enabling faster prototyping of patient-facing assistants that better recognize nuanced emotions during chronic care.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

EmplifAI is a Japanese dataset of 280 medically contextual situations and 4,125 two-turn patient–supporter dialogues labeled with 28 emotions (adapted from GoEmotions). The authors validate the emotion taxonomy via reverse-engineering with multiple LLMs (all BERTScore F1 ≥ 0.83), fine-tune small Japanese LLMs on the dataset (observable gains by automated LLM-judge), and compare LLM-as-a-judge scores with human ratings to expose alignment gaps in subjective empathy evaluation.

Problem Statement

Existing Japanese empathy datasets miss medical context, underrepresent positive or subtle emotions, and use overlapping or imbalanced emotion labels. This gap limits building empathetic agents for long-term chronic care where mixed and shifting emotions matter.

Main Contribution

EmplifAI dataset: 280 situation prompts and 4,125 two-turn patient–supporter dialogues in Japanese across 28 fine-grained emotions.

Japanese translation and validation of GoEmotions taxonomy for medical contexts.

Key Findings

Dataset scale and balance

Numbers280 situations; 4,125 two-turn dialogues; ~14–15 dialogues per emotion-situation

Practical UseYou can fine-tune compact Japanese LLMs on a medically focused, relatively balanced empathy dataset without needing large synthetic data immediately.

Evidence RefSection 3.5, dataset stats

Taxonomy validity via LLMs

NumbersBERTScore F1 ≥ 0.83 across five LLMs

Practical UseThe 28 adapted emotion labels are semantically distinguishable in model tests—suitable as training targets for emotion-aware dialogue models.

Evidence RefSection 4.2; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	280 situations; 4,125 dialogues	—	—	EmplifAI (all)	Two rounds crowdsourcing and manual review produced 280 situations and 4,125 two-turn dialogues	Section 3.5
Emotion prediction (BERTScore F1)	>= 0.83	—	—	reverse-engineering on 4,125 pairs	All five evaluated LLMs achieved BERTScore F1 ≥ 0.83	Section 4.2; Table 2

What To Try In 7 Days

Load the 280 situations and sample 100 pairs to inspect style and labels.

Fine-tune a small Japanese LLM on 1–3 epochs and compare fluency/empathy with and without EmplifAI.

Run simple emotion-prediction tests with BERTScore and FastText to validate label alignment.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

No explicit public dataset or code URL provided, so reuse requires contacting authors.

Data collected via imagined dialogues by crowdworkers; may not match real clinical conversations.

When Not To Use

If you need real clinical transcripts rather than imagined patient–supporter role-play.

For non-Japanese languages or culturally different clinical settings without revalidation.

Failure Modes

LLM-as-a-Judge can overrate 'correct' or solution-focused responses that humans find emotionally inappropriate.

Fine-tuned models may produce overly solution-focused replies that increase patient pressure.

Core Entities

Models

GPT-o3-proDeepSeek-distilled-Qwen-32bLLM-jp-3.1-13b-instruct4Llama-3-Swallow-8b-Instruct-v0.1MedLlama3-JPGemini-2.5-Flash

Metrics

BERTScoreFastText (cosine similarity)5-point Likert metrics: content_comprehensibilitygeneral_empathyemotion_specific_empathyconsistencyfluencyharmlessnesssense_of_security

Datasets

EmplifAIGoEmotionsEmpatheticDialoguesSTUDIESCALLSKokoroChat

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset scale and balance

Taxonomy validity via LLMs

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding