Instruction finetuning small open LLMs (Alpaca, FLAN-T5) boosts mental-health prediction to match or beat much larger models

Overview

Decision SnapshotNeeds Validation

The paper shows clear empirical gains from instruction finetuning small LLMs, but models still make risky errors and show bias, so they are not production-ready without safety work.

Citations59

Evidence Strength0.80

Confidence0.89

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 25%

Novelty: 60%

Authors

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, Dakuo Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors benchmarked Alpaca, Alpaca-LoRA, FLAN-T5, LLaMA2, GPT-3.5 and GPT-4 on six Reddit-based mental-health classification tasks. Zero- and few-shot prompting give useful but limited results. Instruction finetuning a small open model on multiple mental-health datasets (Mental-Alpaca, Mental-FLAN-T5) raised balanced accuracy by ~15–23% over zero-shot and outperformed the best zero/few-shot GPT-3.5 and GPT-4 on average, matching a task-specific state-of-the-art model. The models still make reasoning errors, show bias risks, and are not ready for clinical deployment.

Problem Statement

Can general-purpose LLMs read online text and reliably predict mental-health states? If not, which simple interventions (prompting, few-shot, instruction finetuning) improve performance across many tasks without training one task-specific model per label?

Main Contribution

A broad empirical evaluation of six LLMs on six mental-health classification tasks using multiple prompt strategies and finetuning setups.

Instruction-finetuned open models (Mental-Alpaca, Mental-FLAN-T5) released for multi-task mental-health prediction.

Key Findings

Instruction finetuning markedly improves performance over prompting.

NumbersAlpaca finetuned: +23.4% balanced accuracy vs Alpaca zero-shot

Practical UseIf you can finetune, train open LLMs on multiple mental-health datasets to get much better classifiers than prompt-only approaches.

Evidence RefSec.5.3 / Table 7

Small finetuned models can beat much larger closed models on these tasks.

NumbersMental-Alpaca and Mental-FLAN-T5 beat GPT-3.5 best prompts by ~10.9% and GPT-4 best prompts by ~4.8% (balanced accuracy)

Practical UseFinetuning a small open model can be more cost-effective than using huge closed LLMs via prompting for classification tasks.

Evidence RefIntro / Sec.5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Mental-Alpaca > GPT-3.5_best by ~10.9% (avg over tasks)	GPT-3.5 best zero/few-shot	+10.9%	Average over six tasks (see Sec.5.3 / Table 4)	Instruction-finetuning on multiple datasets improved Alpaca to Mental-Alpaca; text reports ~10.9% avg gain vs GPT-3.5 best prompts	Sec.5.3 / Table 4
Accuracy	Mental-* models > GPT-4_best by ~4.8% (avg over tasks)	GPT-4 best zero/few-shot	+4.8%	Average over six tasks (see Intro and Sec.5.3)	Paper reports Mental-Alpaca and Mental-FLAN-T5 outperform GPT-4 best prompts by 4.8% on balanced accuracy on evaluated tasks	Intro / Sec.5.3

What To Try In 7 Days

Run zero-shot prompts with 'context enhancement' on held-out social posts and measure balanced accuracy versus a simple baseline.

Try few-shot prompts (1 example per class) for a priority binary task and compare performance lift.

If you have a few hundred labeled examples, finetune a small open LLM on multiple tasks to test cross-task gains.

Optimization Features

Infra Optimization

Used 8x A100 GPUs for finetuning

Training Optimization

Instruction finetuning across datasetsLoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/neuhai/Mental-LLM

Data URLs

DreadditDepSeveritySDCNLCSSRS-SuicideRed-SamTwt-60UsersSAD

Risks & Boundaries

Limitations

Datasets are mainly Reddit; cross-platform coverage is limited though external tests were run.

Finetuning experiments exclude closed-source GPT-3.5/GPT-4 due to cost, so comparisons mix finetuned open models vs prompted closed models.

When Not To Use

Do not use these models for clinical diagnosis or unsupervised intervention without expert oversight.

Avoid deploying un-audited reasoning outputs directly to users because explanations can be plausible but wrong.

Failure Modes

False positives from literal descriptions of past anxiety or hypothetical scenarios.

Plausible-sounding but incorrect reasoning (hallucinated causal links).

Core Entities

Models

Alpaca (7B)LoRAFLAN-T5-XXL (11B)LLaMA2 (70B)GPT-3.5 (gpt-3.5-turbo, 175B)GPT-4 (gpt-4-0613, ~1700B)Mental-Alpaca (finetuned)Mental-FLAN-T5 (finetuned)

Metrics

Accuracy

Datasets

DreadditDepSeveritySDCNLCSSRS-SuicideRed-Sam (external)Twt-60Users (external)SAD (external)

Benchmarks

Task #1 Binary stress (Dreaddit)Task #2 Binary depression (DepSeverity)Task #3 Four-level depression (DepSeverity)Task #4 Binary suicide ideation (SDCNL)Task #5 Binary user-level suicide risk (CSSRS)Task #6 Five-level suicide risk (CSSRS)

Context Entities

Models

Mental-RoBERTa (task-specific baseline)BERT (baseline)

Metrics

Accuracy

Datasets

Public Reddit and Twitter mental-health datasets with human annotations

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction finetuning markedly improves performance over prompting.

Small finetuned models can beat much larger closed models on these tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding