Fine-tuned Chinese LLM that answers mental-health Q&A using a CBT (therapeutic) response structure

March 24, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.2

Citation Count

6

Authors

Hongbin Na

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning LLMs with therapy‑structured prompts creates more structured, CBT‑aligned replies for Chinese mental‑health Q&A; useful for building triage assistants and clinician support tools but not a replacement for professionals.

Summary TLDR

The paper builds a CBT‑guided dataset and fine‑tunes Chinese LLMs to produce Cognitive Behavioral Therapy (CBT) style responses for mental-health Q&A. They generated a 22,327-entry CBT QA dataset by prompting ChatGPT with a CBT prompt, then instruction‑tuned models (LoRA) on that data. Automatic metrics and a small human study show modest gains in structured, therapy-aligned answers, but the system is not a clinical tool and has limits: single-turn replies, partial annotation, and risk of false positives in distortion detection.

Problem Statement

Public mental-health Q&A data are noisy and lack grounding in concrete therapy methods. That makes it hard for LLMs to produce answers that follow established CBT steps (empathize, identify thoughts, challenge, strategy, encouragement). The paper aims to create CBT-structured data and fine-tune models so outputs better match therapy practice while acknowledging clinical limits.

Main Contribution

A CBT-specific prompt that structures single-turn replies into five CBT steps (empathy, identify thought, challenge, strategy, encouragement).

A CBT QA dataset of 22,327 Chinese question-description-answer triples generated by ChatGPT using the CBT prompt (derived from PsyQA).

Instruction fine-tuning (LoRA) of multiple Chinese LLM backbones to create CBT-LLM variants and a mixed automatic + human evaluation showing modest improvements.

Key Findings

Created a CBT QA dataset with 22,327 entries.

Numbers22,327 entries (Table 1)

Over half of generated CBT responses contain at least one cognitive distortion label.

Numbers12,136 / 22,327 = 54.4% (Table 1)

CBT prompt-based detection of distortions finds most real cases but produces false positives.

NumbersAccuracy 0.69, Recall 0.93, F1 0.65 on 500 annotated samples (Table 3)

Fine-tuning with Baichuan-7B backbone gave the top automatic scores among tested backbones.

NumbersBaichuan BLEU 0.2648 vs LLaMA-Chinese 0.2412 (+0.0236) (Table 4)

Human raters gave partially strong structure and moderate helpfulness scores (scale 0–2).

NumbersBaichuan structure 1.644, helpfulness 1.432 (Table 5)

Results

BLEU (Baichuan-7B backbone)

Value0.2648

BaselineLLaMA-Chinese-7B 0.2412

METEOR (Baichuan-7B backbone)

Value0.4031

BaselineLLaMA-Chinese-7B 0.3758

BERTSCORE (Baichuan-7B backbone)

Value0.7841

BaselineLLaMA-Chinese-7B 0.7793

Human relevance (Baichuan-7B)

Value1.734 / 2.0

BaselineAlpaca-Chinese-7B 1.732

Human CBT structure (Baichuan-7B)

Value1.644 / 2.0

BaselineAlpaca-Chinese-7B 1.508

Human helpfulness (Baichuan-7B)

Value1.432 / 2.0

BaselineAlpaca-Chinese-7B 1.408

Who Should Care

What To Try In 7 Days

Run the Hugging Face CBT-LLM on 50 representative queries to inspect CBT structure adherence.

Compare a baseline LLM vs CBT-LLM for relevance and structure on your domain data.

Prototype LoRA instruction-tuning with 1k CBT-style pairs to observe quick gains in reply format and tone control.

Optimization Features

Infra Optimization

  • NVIDIA V100 32G

Model Optimization

  • LoRA

System Optimization

  • Cosine learning rate scheduler

Training Optimization

  • Instruction tuning (task directives)
  • Gradient accumulation (steps=4)
  • 16-bit mixed precision

Reproducibility

Data Urls

  • PsyQA source: https://www.xinli001.com/qa
  • CBT QA dataset: to be publicly released for research (paper statement)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No ground-truth annotation for cognitive distortions; generated labels may be inaccurate.
  • Single-turn Q&A format; not a multi-turn counseling dialogue.
  • Evaluation uses a small human sample (100 examples) and synthetic CBT answers.
  • Model and dataset intended for research only, not clinical use.

When Not To Use

  • Do not use as a standalone clinical diagnosis or therapy tool.
  • Avoid deploying without clinician oversight and safety controls.
  • Not suitable where verifiable, evidence‑based medical advice is required.

Failure Modes

  • False positives when identifying cognitive distortions (prompt is sensitive but noisy).
  • Hallucinated or inappropriate therapy suggestions if prompts extrapolate beyond data.
  • User overwhelm from dense single-turn CBT responses that include many steps at once.

Core Entities

Models

  • LLaMA-Chinese-7B
  • Alpaca-Chinese-7B
  • Qwen-7B
  • Baichuan-7B
  • CBT-LLM (fine-tuned variants)

Metrics

  • BLEU
  • METEOR
  • CHRF
  • BLEURT
  • BERTSCORE
  • Human relevance (0-2)
  • Human CBT structure (0-2)
  • Human helpfulness (0-2)

Datasets

  • PsyQA
  • CBT QA dataset (this paper, 22,327 entries)