Psy-LLM: fine-tuned Chinese LLM for scalable online mental-health Q&A

Overview

Decision SnapshotNeeds Validation

The system is a functioning prototype with deployable UI and human evaluation, but limited by data quality, compute resources and ethical constraints; suitable as an assistive tool under supervision.

Citations34

Evidence Strength0.60

Confidence0.70

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 35%

Authors

Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, Ziqi Wang

Links

Abstract / PDF

Why It Matters For Business

An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.

Who Should Care

Product Manager ML Engineer CTO Founder Data Scientist

Summary TLDR

Psy-LLM is a prototype system that fine-tunes Chinese LLMs (PanGu 350M and WenZhong variants) on domain Q&A and large web-crawled psychological text to provide online mental-health question-answering. The team assembled ~400k crawled samples plus PsyQA (22k questions, 56k answers), fine-tuned the models, deployed a web front end and collected human ratings (6 psychology students, 200 Q-A). PanGu 350M outperformed WenZhong on perplexity (34.56 vs 38.40), ROUGE-L (28.18 vs 23.56) and human scores. The system is a practical assistive tool for screening and immediate support but is not a replacement for trained counsellors and faces data, ethical, and deployment limits.

Problem Statement

There are far too few Chinese-speaking mental-health professionals for growing demand. The paper aims to build an AI-assisted Q&A system that gives timely, domain-aware responses and supports counsellors or users when human help is unavailable.

Main Contribution

Defined Psy-LLM: a pipeline to fine-tune Chinese LLMs for psychological Q&A and deploy a web front end

Collected and cleaned a domain corpus (~400k samples) and used PsyQA (22k questions, 56k answers) for fine-tuning

Key Findings

PanGu 350M produced lower perplexity than WenZhong on the evaluation data.

NumbersPerplexity: PanGu 34.56 vs WenZhong 38.40

Practical UsePrefer PanGu 350M for Chinese psych Q&A when compute allows; it predicts tokens better on the authors' data.

Evidence RefTable 5 (Intrinsic evaluation)

Generated answers from PanGu were judged better by humans across four quality axes.

NumbersHuman ratings (Helpfulness): PanGu 3.87 vs WenZhong 3.56 (scale 1–5)

Practical UseFine-tuning PanGu improved perceived usefulness and fluency versus WenZhong; use PanGu-style models for initial deployment.

Evidence RefTable 6 (Human evaluation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity	PanGu 34.56; WenZhong 38.40	—	PanGu −3.84 vs WenZhong	Evaluation set from PsyQA / crawled data (authors)	Table 5 intrinsic evaluation	Table 5
ROUGE-L	PanGu 28.18; WenZhong 23.56	—	PanGu +4.62	Evaluation set from PsyQA / crawled data (authors)	Table 5 intrinsic evaluation	Table 5

What To Try In 7 Days

Fine-tune a mid-size Chinese LLM on a curated subset of PsyQA to validate domain answers

Deploy a minimal web front end (React + Flask on EC2) and collect user ratings

Run a small human-evaluation (20–50 Q-A) with domain experts to detect major failure modes

Optimization Features

Infra Optimization

Used single V100 (PanGu) and RTX3060 (WenZhong) due to resource limits

System Optimization

Distributed crawler used for data collection

Training Optimization

Two-stage training: general crawl pretrain then PsyQA fine-tuneEarly stopping used to avoid overfittingBatch size 8; trained ~100k iterations (PanGu)

Inference Optimization

Deployed as Flask API on EC2 for low-latency responsesFrontend asynchronous calls to mask model latency

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Data quality: web-crawled data is noisy and only partially reviewed by experts

No nonverbal cues: text-only models miss key counselling signals

When Not To Use

As sole responder for suicidal or high-risk emergencies

As a clinical diagnostic or legal decision tool without clinician oversight

Failure Modes

Hallucinations or irrelevant outputs that fail to answer the question

Overfitting to noisy web data leading to incoherent logic

Core Entities

Models

PanGu 350MPanGu (other sizes mentioned)WenZhong 3.5BWenZhong-GPT2-110M

Metrics

PerplexityROUGE-LDistinct-1Distinct-2Human ratings: Helpfulness, Fluency, Relevance, Logic

Datasets

PsyQA (22k Q, 56k A)Crawled corpus from TianyaCrawled corpus from ZhihuCrawled corpus from Yixinli

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PanGu 350M produced lower perplexity than WenZhong on the evaluation data.

Generated answers from PanGu were judged better by humans across four quality axes.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding