Overview
Production Readiness
0.4
Novelty Score
0.35
Cost Impact Score
0.6
Citation Count
34
Why It Matters For Business
An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.
Summary TLDR
Psy-LLM is a prototype system that fine-tunes Chinese LLMs (PanGu 350M and WenZhong variants) on domain Q&A and large web-crawled psychological text to provide online mental-health question-answering. The team assembled ~400k crawled samples plus PsyQA (22k questions, 56k answers), fine-tuned the models, deployed a web front end and collected human ratings (6 psychology students, 200 Q-A). PanGu 350M outperformed WenZhong on perplexity (34.56 vs 38.40), ROUGE-L (28.18 vs 23.56) and human scores. The system is a practical assistive tool for screening and immediate support but is not a replacement for trained counsellors and faces data, ethical, and deployment limits.
Problem Statement
There are far too few Chinese-speaking mental-health professionals for growing demand. The paper aims to build an AI-assisted Q&A system that gives timely, domain-aware responses and supports counsellors or users when human help is unavailable.
Main Contribution
Defined Psy-LLM: a pipeline to fine-tune Chinese LLMs for psychological Q&A and deploy a web front end
Collected and cleaned a domain corpus (~400k samples) and used PsyQA (22k questions, 56k answers) for fine-tuning
Fine-tuned and compared PanGu 350M vs WenZhong variants and reported intrinsic and human evaluation
Built a deployed prototype (React front end, EC2 backend) that logs user ratings for iterative improvement
Key Findings
PanGu 350M produced lower perplexity than WenZhong on the evaluation data.
Generated answers from PanGu were judged better by humans across four quality axes.
Ground-truth human answers score substantially higher than model outputs.
Training data and scale: a large mixed dataset assembled for domain tuning.
Prototype was deployed as a web service and supports real-time interaction.
Results
Perplexity
ROUGE-L
Distinct-1 / Distinct-2
Human Helpfulness (AI-only comparison)
Human vs Ground Truth (Helpfulness)
Who Should Care
What To Try In 7 Days
Fine-tune a mid-size Chinese LLM on a curated subset of PsyQA to validate domain answers
Deploy a minimal web front end (React + Flask on EC2) and collect user ratings
Run a small human-evaluation (20–50 Q-A) with domain experts to detect major failure modes
Optimization Features
Infra Optimization
- Used single V100 (PanGu) and RTX3060 (WenZhong) due to resource limits
System Optimization
- Distributed crawler used for data collection
Training Optimization
- Two-stage training: general crawl pretrain then PsyQA fine-tune
- Early stopping used to avoid overfitting
- Batch size 8; trained ~100k iterations (PanGu)
Inference Optimization
- Deployed as Flask API on EC2 for low-latency responses
- Frontend asynchronous calls to mask model latency
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Data quality: web-crawled data is noisy and only partially reviewed by experts
- No nonverbal cues: text-only models miss key counselling signals
- Not a replacement for trained counsellors; models lag human answers by ~1 point
- Compute and resource constraints limited model scale and tuning
- Autoregressive LLMs have limited bidirectional context and exposure bias
- No public code or dataset release reported
When Not To Use
- As sole responder for suicidal or high-risk emergencies
- As a clinical diagnostic or legal decision tool without clinician oversight
- In contexts requiring strict medical-grade accuracy or documentation
Failure Modes
- Hallucinations or irrelevant outputs that fail to answer the question
- Overfitting to noisy web data leading to incoherent logic
- Underperformance due to limited compute/tuning
- Privacy or security gaps if deployment is misconfigured
Core Entities
Models
- PanGu 350M
- PanGu (other sizes mentioned)
- WenZhong 3.5B
- WenZhong-GPT2-110M
Metrics
- Perplexity
- ROUGE-L
- Distinct-1
- Distinct-2
- Human ratings: Helpfulness, Fluency, Relevance, Logic
Datasets
- PsyQA (22k Q, 56k A)
- Crawled corpus from Tianya
- Crawled corpus from Zhihu
- Crawled corpus from Yixinli

