Overview
Production Readiness
0.25
Novelty Score
0.6
Cost Impact Score
0.55
Citation Count
59
Why It Matters For Business
Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.
Summary TLDR
The authors benchmarked Alpaca, Alpaca-LoRA, FLAN-T5, LLaMA2, GPT-3.5 and GPT-4 on six Reddit-based mental-health classification tasks. Zero- and few-shot prompting give useful but limited results. Instruction finetuning a small open model on multiple mental-health datasets (Mental-Alpaca, Mental-FLAN-T5) raised balanced accuracy by ~15–23% over zero-shot and outperformed the best zero/few-shot GPT-3.5 and GPT-4 on average, matching a task-specific state-of-the-art model. The models still make reasoning errors, show bias risks, and are not ready for clinical deployment.
Problem Statement
Can general-purpose LLMs read online text and reliably predict mental-health states? If not, which simple interventions (prompting, few-shot, instruction finetuning) improve performance across many tasks without training one task-specific model per label?
Main Contribution
A broad empirical evaluation of six LLMs on six mental-health classification tasks using multiple prompt strategies and finetuning setups.
Instruction-finetuned open models (Mental-Alpaca, Mental-FLAN-T5) released for multi-task mental-health prediction.
Practical guidelines on when to use prompts, few-shot, or instruction finetuning and on data-efficiency trade-offs.
A short case study of model reasoning and a discussion of ethical risks including demographic bias and unsafe explanations.
Key Findings
Instruction finetuning markedly improves performance over prompting.
Small finetuned models can beat much larger closed models on these tasks.
Few-shot prompting gives modest but useful gains over zero-shot.
Prompt 'context enhancement' is the most reliably helpful prompt strategy.
Reasoning outputs remain brittle and can be misleading.
Finetuning needs surprisingly little data if varied.
Results
Accuracy
Accuracy
Accuracy
Few-shot vs zero-shot
Who Should Care
What To Try In 7 Days
Run zero-shot prompts with 'context enhancement' on held-out social posts and measure balanced accuracy versus a simple baseline.
Try few-shot prompts (1 example per class) for a priority binary task and compare performance lift.
If you have a few hundred labeled examples, finetune a small open LLM on multiple tasks to test cross-task gains.
Optimization Features
Infra Optimization
- Used 8x A100 GPUs for finetuning
Training Optimization
- Instruction finetuning across datasets
- LoRA
Reproducibility
Code Urls
Data Urls
- Dreaddit
- DepSeverity
- SDCNL
- CSSRS-Suicide
- Red-Sam
- Twt-60Users
- SAD
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Datasets are mainly Reddit; cross-platform coverage is limited though external tests were run.
- Finetuning experiments exclude closed-source GPT-3.5/GPT-4 due to cost, so comparisons mix finetuned open models vs prompted closed models.
- Reasoning evaluation is exploratory and not systematically quantified.
- Lack of demographic labels prevents a thorough fairness analysis.
When Not To Use
- Do not use these models for clinical diagnosis or unsupervised intervention without expert oversight.
- Avoid deploying un-audited reasoning outputs directly to users because explanations can be plausible but wrong.
- Do not rely on zero-shot prompts alone for high-stakes suicide-risk decisions.
Failure Modes
- False positives from literal descriptions of past anxiety or hypothetical scenarios.
- Plausible-sounding but incorrect reasoning (hallucinated causal links).
- Bias across demographic groups not measured here may cause unequal performance.
- Finetuned classification-only models can lose the ability to generate useful explanations.
Core Entities
Models
- Alpaca (7B)
- LoRA
- FLAN-T5-XXL (11B)
- LLaMA2 (70B)
- GPT-3.5 (gpt-3.5-turbo, 175B)
- GPT-4 (gpt-4-0613, ~1700B)
- Mental-Alpaca (finetuned)
- Mental-FLAN-T5 (finetuned)
Metrics
- Accuracy
Datasets
- Dreaddit
- DepSeverity
- SDCNL
- CSSRS-Suicide
- Red-Sam (external)
- Twt-60Users (external)
- SAD (external)
Benchmarks
- Task #1 Binary stress (Dreaddit)
- Task #2 Binary depression (DepSeverity)
- Task #3 Four-level depression (DepSeverity)
- Task #4 Binary suicide ideation (SDCNL)
- Task #5 Binary user-level suicide risk (CSSRS)
- Task #6 Five-level suicide risk (CSSRS)
Context Entities
Models
- Mental-RoBERTa (task-specific baseline)
- BERT (baseline)
Metrics
- Accuracy
Datasets
- Public Reddit and Twitter mental-health datasets with human annotations

