EmplifAI: 4,125 Japanese medical two‑turn dialogues labeled with 28 fine-grained emotions

January 15, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Wan Jou She, Lis Kanashiro Pereira, Fei Cheng, Sakiko Yahata, Panote Siriaraya, Eiji Aramaki

Links

Abstract / PDF

Why It Matters For Business

EmplifAI fills a Japanese, medically focused empathy data gap, enabling faster prototyping of patient-facing assistants that better recognize nuanced emotions during chronic care.

Summary TLDR

EmplifAI is a Japanese dataset of 280 medically contextual situations and 4,125 two-turn patient–supporter dialogues labeled with 28 emotions (adapted from GoEmotions). The authors validate the emotion taxonomy via reverse-engineering with multiple LLMs (all BERTScore F1 ≥ 0.83), fine-tune small Japanese LLMs on the dataset (observable gains by automated LLM-judge), and compare LLM-as-a-judge scores with human ratings to expose alignment gaps in subjective empathy evaluation.

Problem Statement

Existing Japanese empathy datasets miss medical context, underrepresent positive or subtle emotions, and use overlapping or imbalanced emotion labels. This gap limits building empathetic agents for long-term chronic care where mixed and shifting emotions matter.

Main Contribution

EmplifAI dataset: 280 situation prompts and 4,125 two-turn patient–supporter dialogues in Japanese across 28 fine-grained emotions.

Japanese translation and validation of GoEmotions taxonomy for medical contexts.

Evaluation pipelines: reverse-engineering emotion prediction (FastText + BERTScore), supervised fine-tuning baselines, and LLM-as-a-Judge vs human rater comparison.

Key Findings

Dataset scale and balance

Numbers280 situations; 4,125 two-turn dialogues; ~14–15 dialogues per emotion-situation

Taxonomy validity via LLMs

NumbersBERTScore F1 ≥ 0.83 across five LLMs

Fine-tuning improves automated empathy metrics

NumbersLLM-judge content score for LLM-jp rose from 1.00 → 2.46 (LLM-judge)

LLM-as-a-Judge often correlates with humans but can fail

NumbersDeepSeek Pearson correlations ≈ 0.59–0.79 (p<0.01); GPT judge showed large MADs and weak correlations

Results

Dataset size

Value280 situations; 4,125 dialogues

Emotion prediction (BERTScore F1)

Value>= 0.83

SFT

Value1.00 → 2.46

BaselineLLM-jp zero-shot

Human vs LLM-judge correlation (DeepSeek)

ValuePearson ≈ 0.59–0.79 on metrics (p<0.01)

Who Should Care

What To Try In 7 Days

Load the 280 situations and sample 100 pairs to inspect style and labels.

Fine-tune a small Japanese LLM on 1–3 epochs and compare fluency/empathy with and without EmplifAI.

Run simple emotion-prediction tests with BERTScore and FastText to validate label alignment.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No explicit public dataset or code URL provided, so reuse requires contacting authors.
  • Data collected via imagined dialogues by crowdworkers; may not match real clinical conversations.
  • Designed for chronic medical contexts in Japanese; limited generalization to other domains or cultures.
  • Prompt and length differences: models produced longer replies than crowdworkers, which can bias evaluations.

When Not To Use

  • If you need real clinical transcripts rather than imagined patient–supporter role-play.
  • For non-Japanese languages or culturally different clinical settings without revalidation.
  • When strict medical advice accuracy is required—EmplifAI focuses on empathetic style, not clinical correctness.

Failure Modes

  • LLM-as-a-Judge can overrate 'correct' or solution-focused responses that humans find emotionally inappropriate.
  • Fine-tuned models may produce overly solution-focused replies that increase patient pressure.
  • Label ambiguity may remain for closely related emotions despite validation (some FastText scores were lower).

Core Entities

Models

  • GPT-o3-pro
  • DeepSeek-distilled-Qwen-32b
  • LLM-jp-3.1-13b-instruct4
  • Llama-3-Swallow-8b-Instruct-v0.1
  • MedLlama3-JP
  • Gemini-2.5-Flash

Metrics

  • BERTScore
  • FastText (cosine similarity)
  • 5-point Likert metrics: content_comprehensibility
  • general_empathy
  • emotion_specific_empathy
  • consistency
  • fluency
  • harmlessness
  • sense_of_security

Datasets

  • EmplifAI
  • GoEmotions
  • EmpatheticDialogues
  • STUDIES
  • CALLS
  • KokoroChat