Overview
The system is a functioning prototype with deployable UI and human evaluation, but limited by data quality, compute resources and ethical constraints; suitable as an assistive tool under supervision.
Citations34
Evidence Strength0.60
Confidence0.70
Risk Signals13
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 35%
Why It Matters For Business
An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.
Who Should Care
Summary TLDR
Psy-LLM is a prototype system that fine-tunes Chinese LLMs (PanGu 350M and WenZhong variants) on domain Q&A and large web-crawled psychological text to provide online mental-health question-answering. The team assembled ~400k crawled samples plus PsyQA (22k questions, 56k answers), fine-tuned the models, deployed a web front end and collected human ratings (6 psychology students, 200 Q-A). PanGu 350M outperformed WenZhong on perplexity (34.56 vs 38.40), ROUGE-L (28.18 vs 23.56) and human scores. The system is a practical assistive tool for screening and immediate support but is not a replacement for trained counsellors and faces data, ethical, and deployment limits.
Problem Statement
There are far too few Chinese-speaking mental-health professionals for growing demand. The paper aims to build an AI-assisted Q&A system that gives timely, domain-aware responses and supports counsellors or users when human help is unavailable.
Main Contribution
Defined Psy-LLM: a pipeline to fine-tune Chinese LLMs for psychological Q&A and deploy a web front end
Collected and cleaned a domain corpus (~400k samples) and used PsyQA (22k questions, 56k answers) for fine-tuning
Key Findings
PanGu 350M produced lower perplexity than WenZhong on the evaluation data.
Generated answers from PanGu were judged better by humans across four quality axes.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity | PanGu 34.56; WenZhong 38.40 | — | PanGu −3.84 vs WenZhong | Evaluation set from PsyQA / crawled data (authors) | Table 5 intrinsic evaluation | Table 5 |
| ROUGE-L | PanGu 28.18; WenZhong 23.56 | — | PanGu +4.62 | Evaluation set from PsyQA / crawled data (authors) | Table 5 intrinsic evaluation | Table 5 |
What To Try In 7 Days
Fine-tune a mid-size Chinese LLM on a curated subset of PsyQA to validate domain answers
Deploy a minimal web front end (React + Flask on EC2) and collect user ratings
Run a small human-evaluation (20–50 Q-A) with domain experts to detect major failure modes
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Data quality: web-crawled data is noisy and only partially reviewed by experts
No nonverbal cues: text-only models miss key counselling signals
When Not To Use
As sole responder for suicidal or high-risk emergencies
As a clinical diagnostic or legal decision tool without clinician oversight
Failure Modes
Hallucinations or irrelevant outputs that fail to answer the question
Overfitting to noisy web data leading to incoherent logic

