Overview
Production Readiness
0.3
Novelty Score
0.5
Cost Impact Score
0.2
Citation Count
6
Why It Matters For Business
Fine-tuning LLMs with therapy‑structured prompts creates more structured, CBT‑aligned replies for Chinese mental‑health Q&A; useful for building triage assistants and clinician support tools but not a replacement for professionals.
Summary TLDR
The paper builds a CBT‑guided dataset and fine‑tunes Chinese LLMs to produce Cognitive Behavioral Therapy (CBT) style responses for mental-health Q&A. They generated a 22,327-entry CBT QA dataset by prompting ChatGPT with a CBT prompt, then instruction‑tuned models (LoRA) on that data. Automatic metrics and a small human study show modest gains in structured, therapy-aligned answers, but the system is not a clinical tool and has limits: single-turn replies, partial annotation, and risk of false positives in distortion detection.
Problem Statement
Public mental-health Q&A data are noisy and lack grounding in concrete therapy methods. That makes it hard for LLMs to produce answers that follow established CBT steps (empathize, identify thoughts, challenge, strategy, encouragement). The paper aims to create CBT-structured data and fine-tune models so outputs better match therapy practice while acknowledging clinical limits.
Main Contribution
A CBT-specific prompt that structures single-turn replies into five CBT steps (empathy, identify thought, challenge, strategy, encouragement).
A CBT QA dataset of 22,327 Chinese question-description-answer triples generated by ChatGPT using the CBT prompt (derived from PsyQA).
Instruction fine-tuning (LoRA) of multiple Chinese LLM backbones to create CBT-LLM variants and a mixed automatic + human evaluation showing modest improvements.
Key Findings
Created a CBT QA dataset with 22,327 entries.
Over half of generated CBT responses contain at least one cognitive distortion label.
CBT prompt-based detection of distortions finds most real cases but produces false positives.
Fine-tuning with Baichuan-7B backbone gave the top automatic scores among tested backbones.
Human raters gave partially strong structure and moderate helpfulness scores (scale 0–2).
Results
BLEU (Baichuan-7B backbone)
METEOR (Baichuan-7B backbone)
BERTSCORE (Baichuan-7B backbone)
Human relevance (Baichuan-7B)
Human CBT structure (Baichuan-7B)
Human helpfulness (Baichuan-7B)
Who Should Care
What To Try In 7 Days
Run the Hugging Face CBT-LLM on 50 representative queries to inspect CBT structure adherence.
Compare a baseline LLM vs CBT-LLM for relevance and structure on your domain data.
Prototype LoRA instruction-tuning with 1k CBT-style pairs to observe quick gains in reply format and tone control.
Optimization Features
Infra Optimization
- NVIDIA V100 32G
Model Optimization
- LoRA
System Optimization
- Cosine learning rate scheduler
Training Optimization
- Instruction tuning (task directives)
- Gradient accumulation (steps=4)
- 16-bit mixed precision
Reproducibility
Code Urls
Data Urls
- PsyQA source: https://www.xinli001.com/qa
- CBT QA dataset: to be publicly released for research (paper statement)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No ground-truth annotation for cognitive distortions; generated labels may be inaccurate.
- Single-turn Q&A format; not a multi-turn counseling dialogue.
- Evaluation uses a small human sample (100 examples) and synthetic CBT answers.
- Model and dataset intended for research only, not clinical use.
When Not To Use
- Do not use as a standalone clinical diagnosis or therapy tool.
- Avoid deploying without clinician oversight and safety controls.
- Not suitable where verifiable, evidence‑based medical advice is required.
Failure Modes
- False positives when identifying cognitive distortions (prompt is sensitive but noisy).
- Hallucinated or inappropriate therapy suggestions if prompts extrapolate beyond data.
- User overwhelm from dense single-turn CBT responses that include many steps at once.
Core Entities
Models
- LLaMA-Chinese-7B
- Alpaca-Chinese-7B
- Qwen-7B
- Baichuan-7B
- CBT-LLM (fine-tuned variants)
Metrics
- BLEU
- METEOR
- CHRF
- BLEURT
- BERTSCORE
- Human relevance (0-2)
- Human CBT structure (0-2)
- Human helpfulness (0-2)
Datasets
- PsyQA
- CBT QA dataset (this paper, 22,327 entries)

