Fine-tuned Chinese LLM that answers mental-health Q&A using a CBT (therapeutic) response structure

Overview

Decision SnapshotNeeds Validation

The paper shows a practical pipeline: CBT prompt → synthetic CBT dataset → LoRA fine-tuning, yielding modest metric and human-evaluation gains on a Chinese test set. Results are promising for prototyping but not clinical deployment; more labels and multi-turn clinical validation are needed.

Citations6

Evidence Strength0.60

Confidence0.70

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 50%

Authors

Hongbin Na

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Fine-tuning LLMs with therapy‑structured prompts creates more structured, CBT‑aligned replies for Chinese mental‑health Q&A; useful for building triage assistants and clinician support tools but not a replacement for professionals.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

The paper builds a CBT‑guided dataset and fine‑tunes Chinese LLMs to produce Cognitive Behavioral Therapy (CBT) style responses for mental-health Q&A. They generated a 22,327-entry CBT QA dataset by prompting ChatGPT with a CBT prompt, then instruction‑tuned models (LoRA) on that data. Automatic metrics and a small human study show modest gains in structured, therapy-aligned answers, but the system is not a clinical tool and has limits: single-turn replies, partial annotation, and risk of false positives in distortion detection.

Problem Statement

Public mental-health Q&A data are noisy and lack grounding in concrete therapy methods. That makes it hard for LLMs to produce answers that follow established CBT steps (empathize, identify thoughts, challenge, strategy, encouragement). The paper aims to create CBT-structured data and fine-tune models so outputs better match therapy practice while acknowledging clinical limits.

Main Contribution

A CBT-specific prompt that structures single-turn replies into five CBT steps (empathy, identify thought, challenge, strategy, encouragement).

A CBT QA dataset of 22,327 Chinese question-description-answer triples generated by ChatGPT using the CBT prompt (derived from PsyQA).

Key Findings

Created a CBT QA dataset with 22,327 entries.

Numbers22,327 entries (Table 1)

Practical UseYou can fine-tune Chinese LLMs on ~22k CBT‑style pairs to teach therapy-shaped reply structure.

Evidence RefTable 1

Over half of generated CBT responses contain at least one cognitive distortion label.

Numbers12,136 / 22,327 = 54.4% (Table 1)

Practical UseTraining data embeds common therapy targets, so models learn to detect and address distorted thoughts in many cases.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU (Baichuan-7B backbone)	0.2648	LLaMA-Chinese-7B 0.2412	+0.0236	CBT QA test split	Automatic eval (Table 4)	Table 4
METEOR (Baichuan-7B backbone)	0.4031	LLaMA-Chinese-7B 0.3758	+0.0273	CBT QA test split	Automatic eval (Table 4)	Table 4

What To Try In 7 Days

Run the Hugging Face CBT-LLM on 50 representative queries to inspect CBT structure adherence.

Compare a baseline LLM vs CBT-LLM for relevance and structure on your domain data.

Prototype LoRA instruction-tuning with 1k CBT-style pairs to observe quick gains in reply format and tone control.

Optimization Features

Infra Optimization

NVIDIA V100 32G

Model Optimization

LoRA

System Optimization

Cosine learning rate scheduler

Training Optimization

Instruction tuning (task directives)Gradient accumulation (steps=4)16-bit mixed precision

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hiyouga/LLaMA-Factory (used for fine-tuning)https://huggingface.co/Hongbin37/CBT-LLM (model release)

Data URLs

PsyQA source: https://www.xinli001.com/qaCBT QA dataset: to be publicly released for research (paper statement)

Risks & Boundaries

Limitations

No ground-truth annotation for cognitive distortions; generated labels may be inaccurate.

Single-turn Q&A format; not a multi-turn counseling dialogue.

When Not To Use

Do not use as a standalone clinical diagnosis or therapy tool.

Avoid deploying without clinician oversight and safety controls.

Failure Modes

False positives when identifying cognitive distortions (prompt is sensitive but noisy).

Hallucinated or inappropriate therapy suggestions if prompts extrapolate beyond data.

Core Entities

Models

LLaMA-Chinese-7BAlpaca-Chinese-7BQwen-7BBaichuan-7BCBT-LLM (fine-tuned variants)

Metrics

BLEUMETEORCHRFBLEURTBERTSCOREHuman relevance (0-2)Human CBT structure (0-2)Human helpfulness (0-2)

Datasets

PsyQACBT QA dataset (this paper, 22,327 entries)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Created a CBT QA dataset with 22,327 entries.

Over half of generated CBT responses contain at least one cognitive distortion label.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding