Psy-LLM: fine-tuned Chinese LLM for scalable online mental-health Q&A

July 22, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.35

Cost Impact Score

0.6

Citation Count

34

Authors

Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, Ziqi Wang

Links

Abstract / PDF

Why It Matters For Business

An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.

Summary TLDR

Psy-LLM is a prototype system that fine-tunes Chinese LLMs (PanGu 350M and WenZhong variants) on domain Q&A and large web-crawled psychological text to provide online mental-health question-answering. The team assembled ~400k crawled samples plus PsyQA (22k questions, 56k answers), fine-tuned the models, deployed a web front end and collected human ratings (6 psychology students, 200 Q-A). PanGu 350M outperformed WenZhong on perplexity (34.56 vs 38.40), ROUGE-L (28.18 vs 23.56) and human scores. The system is a practical assistive tool for screening and immediate support but is not a replacement for trained counsellors and faces data, ethical, and deployment limits.

Problem Statement

There are far too few Chinese-speaking mental-health professionals for growing demand. The paper aims to build an AI-assisted Q&A system that gives timely, domain-aware responses and supports counsellors or users when human help is unavailable.

Main Contribution

Defined Psy-LLM: a pipeline to fine-tune Chinese LLMs for psychological Q&A and deploy a web front end

Collected and cleaned a domain corpus (~400k samples) and used PsyQA (22k questions, 56k answers) for fine-tuning

Fine-tuned and compared PanGu 350M vs WenZhong variants and reported intrinsic and human evaluation

Built a deployed prototype (React front end, EC2 backend) that logs user ratings for iterative improvement

Key Findings

PanGu 350M produced lower perplexity than WenZhong on the evaluation data.

NumbersPerplexity: PanGu 34.56 vs WenZhong 38.40

Generated answers from PanGu were judged better by humans across four quality axes.

NumbersHuman ratings (Helpfulness): PanGu 3.87 vs WenZhong 3.56 (scale 1–5)

Ground-truth human answers score substantially higher than model outputs.

NumbersGround truth Helpfulness 4.52 vs PanGu 3.54 (Table 7)

Training data and scale: a large mixed dataset assembled for domain tuning.

NumbersDataset: ~400k crawled samples; PsyQA 22k questions / 56k answers; 2.85GB psychology corpus

Prototype was deployed as a web service and supports real-time interaction.

NumbersUsers can get responses 'within seconds' via EC2 + Flask + React front end

Results

Perplexity

ValuePanGu 34.56; WenZhong 38.40

ROUGE-L

ValuePanGu 28.18; WenZhong 23.56

Distinct-1 / Distinct-2

ValuePanGu 4.57 / 12.74; WenZhong 3.55 / 9.67

Human Helpfulness (AI-only comparison)

ValuePanGu 3.87; WenZhong 3.56 (1–5 scale)

Human vs Ground Truth (Helpfulness)

ValueGround truth 4.52; PanGu 3.54; WenZhong 3.45

BaselineGround truth answers

Who Should Care

What To Try In 7 Days

Fine-tune a mid-size Chinese LLM on a curated subset of PsyQA to validate domain answers

Deploy a minimal web front end (React + Flask on EC2) and collect user ratings

Run a small human-evaluation (20–50 Q-A) with domain experts to detect major failure modes

Optimization Features

Infra Optimization

  • Used single V100 (PanGu) and RTX3060 (WenZhong) due to resource limits

System Optimization

  • Distributed crawler used for data collection

Training Optimization

  • Two-stage training: general crawl pretrain then PsyQA fine-tune
  • Early stopping used to avoid overfitting
  • Batch size 8; trained ~100k iterations (PanGu)

Inference Optimization

  • Deployed as Flask API on EC2 for low-latency responses
  • Frontend asynchronous calls to mask model latency

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Data quality: web-crawled data is noisy and only partially reviewed by experts
  • No nonverbal cues: text-only models miss key counselling signals
  • Not a replacement for trained counsellors; models lag human answers by ~1 point
  • Compute and resource constraints limited model scale and tuning
  • Autoregressive LLMs have limited bidirectional context and exposure bias
  • No public code or dataset release reported

When Not To Use

  • As sole responder for suicidal or high-risk emergencies
  • As a clinical diagnostic or legal decision tool without clinician oversight
  • In contexts requiring strict medical-grade accuracy or documentation

Failure Modes

  • Hallucinations or irrelevant outputs that fail to answer the question
  • Overfitting to noisy web data leading to incoherent logic
  • Underperformance due to limited compute/tuning
  • Privacy or security gaps if deployment is misconfigured

Core Entities

Models

  • PanGu 350M
  • PanGu (other sizes mentioned)
  • WenZhong 3.5B
  • WenZhong-GPT2-110M

Metrics

  • Perplexity
  • ROUGE-L
  • Distinct-1
  • Distinct-2
  • Human ratings: Helpfulness, Fluency, Relevance, Logic

Datasets

  • PsyQA (22k Q, 56k A)
  • Crawled corpus from Tianya
  • Crawled corpus from Zhihu
  • Crawled corpus from Yixinli