Psy-LLM: fine-tuned Chinese LLM for scalable online mental-health Q&A

July 22, 20237 min

Overview

Decision SnapshotNeeds Validation

The system is a functioning prototype with deployable UI and human evaluation, but limited by data quality, compute resources and ethical constraints; suitable as an assistive tool under supervision.

Citations34

Evidence Strength0.60

Confidence0.70

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 35%

Authors

Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, Ziqi Wang

Links

Abstract / PDF

Why It Matters For Business

An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.

Who Should Care

Summary TLDR

Psy-LLM is a prototype system that fine-tunes Chinese LLMs (PanGu 350M and WenZhong variants) on domain Q&A and large web-crawled psychological text to provide online mental-health question-answering. The team assembled ~400k crawled samples plus PsyQA (22k questions, 56k answers), fine-tuned the models, deployed a web front end and collected human ratings (6 psychology students, 200 Q-A). PanGu 350M outperformed WenZhong on perplexity (34.56 vs 38.40), ROUGE-L (28.18 vs 23.56) and human scores. The system is a practical assistive tool for screening and immediate support but is not a replacement for trained counsellors and faces data, ethical, and deployment limits.

Problem Statement

There are far too few Chinese-speaking mental-health professionals for growing demand. The paper aims to build an AI-assisted Q&A system that gives timely, domain-aware responses and supports counsellors or users when human help is unavailable.

Main Contribution

Defined Psy-LLM: a pipeline to fine-tune Chinese LLMs for psychological Q&A and deploy a web front end

Collected and cleaned a domain corpus (~400k samples) and used PsyQA (22k questions, 56k answers) for fine-tuning

Key Findings

PanGu 350M produced lower perplexity than WenZhong on the evaluation data.

NumbersPerplexity: PanGu 34.56 vs WenZhong 38.40

Practical UsePrefer PanGu 350M for Chinese psych Q&A when compute allows; it predicts tokens better on the authors' data.

Evidence RefTable 5 (Intrinsic evaluation)

Generated answers from PanGu were judged better by humans across four quality axes.

NumbersHuman ratings (Helpfulness): PanGu 3.87 vs WenZhong 3.56 (scale 15)

Practical UseFine-tuning PanGu improved perceived usefulness and fluency versus WenZhong; use PanGu-style models for initial deployment.

Evidence RefTable 6 (Human evaluation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PerplexityPanGu 34.56; WenZhong 38.40PanGu −3.84 vs WenZhongEvaluation set from PsyQA / crawled data (authors)Table 5 intrinsic evaluationTable 5
ROUGE-LPanGu 28.18; WenZhong 23.56PanGu +4.62Evaluation set from PsyQA / crawled data (authors)Table 5 intrinsic evaluationTable 5

What To Try In 7 Days

Fine-tune a mid-size Chinese LLM on a curated subset of PsyQA to validate domain answers

Deploy a minimal web front end (React + Flask on EC2) and collect user ratings

Run a small human-evaluation (20–50 Q-A) with domain experts to detect major failure modes

Optimization Features

Infra Optimization
Used single V100 (PanGu) and RTX3060 (WenZhong) due to resource limits
System Optimization
Distributed crawler used for data collection
Training Optimization
Two-stage training: general crawl pretrain then PsyQA fine-tuneEarly stopping used to avoid overfittingBatch size 8; trained ~100k iterations (PanGu)
Inference Optimization
Deployed as Flask API on EC2 for low-latency responsesFrontend asynchronous calls to mask model latency

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Data quality: web-crawled data is noisy and only partially reviewed by experts

No nonverbal cues: text-only models miss key counselling signals

When Not To Use

As sole responder for suicidal or high-risk emergencies

As a clinical diagnostic or legal decision tool without clinician oversight

Failure Modes

Hallucinations or irrelevant outputs that fail to answer the question

Overfitting to noisy web data leading to incoherent logic

Core Entities

Models

PanGu 350MPanGu (other sizes mentioned)WenZhong 3.5BWenZhong-GPT2-110M

Metrics

PerplexityROUGE-LDistinct-1Distinct-2Human ratings: Helpfulness, Fluency, Relevance, Logic

Datasets

PsyQA (22k Q, 56k A)Crawled corpus from TianyaCrawled corpus from ZhihuCrawled corpus from Yixinli