Instruction finetuning small open LLMs (Alpaca, FLAN-T5) boosts mental-health prediction to match or beat much larger models

July 26, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper shows clear empirical gains from instruction finetuning small LLMs, but models still make risky errors and show bias, so they are not production-ready without safety work.

Citations59

Evidence Strength0.80

Confidence0.89

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 25%

Novelty: 60%

Authors

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, Dakuo Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.

Who Should Care

Summary TLDR

The authors benchmarked Alpaca, Alpaca-LoRA, FLAN-T5, LLaMA2, GPT-3.5 and GPT-4 on six Reddit-based mental-health classification tasks. Zero- and few-shot prompting give useful but limited results. Instruction finetuning a small open model on multiple mental-health datasets (Mental-Alpaca, Mental-FLAN-T5) raised balanced accuracy by ~15–23% over zero-shot and outperformed the best zero/few-shot GPT-3.5 and GPT-4 on average, matching a task-specific state-of-the-art model. The models still make reasoning errors, show bias risks, and are not ready for clinical deployment.

Problem Statement

Can general-purpose LLMs read online text and reliably predict mental-health states? If not, which simple interventions (prompting, few-shot, instruction finetuning) improve performance across many tasks without training one task-specific model per label?

Main Contribution

A broad empirical evaluation of six LLMs on six mental-health classification tasks using multiple prompt strategies and finetuning setups.

Instruction-finetuned open models (Mental-Alpaca, Mental-FLAN-T5) released for multi-task mental-health prediction.

Key Findings

Instruction finetuning markedly improves performance over prompting.

NumbersAlpaca finetuned: +23.4% balanced accuracy vs Alpaca zero-shot

Practical UseIf you can finetune, train open LLMs on multiple mental-health datasets to get much better classifiers than prompt-only approaches.

Evidence RefSec.5.3 / Table 7

Small finetuned models can beat much larger closed models on these tasks.

NumbersMental-Alpaca and Mental-FLAN-T5 beat GPT-3.5 best prompts by ~10.9% and GPT-4 best prompts by ~4.8% (balanced accuracy)

Practical UseFinetuning a small open model can be more cost-effective than using huge closed LLMs via prompting for classification tasks.

Evidence RefIntro / Sec.5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyMental-Alpaca > GPT-3.5_best by ~10.9% (avg over tasks)GPT-3.5 best zero/few-shot+10.9%Average over six tasks (see Sec.5.3 / Table 4)Instruction-finetuning on multiple datasets improved Alpaca to Mental-Alpaca; text reports ~10.9% avg gain vs GPT-3.5 best promptsSec.5.3 / Table 4
AccuracyMental-* models > GPT-4_best by ~4.8% (avg over tasks)GPT-4 best zero/few-shot+4.8%Average over six tasks (see Intro and Sec.5.3)Paper reports Mental-Alpaca and Mental-FLAN-T5 outperform GPT-4 best prompts by 4.8% on balanced accuracy on evaluated tasksIntro / Sec.5.3

What To Try In 7 Days

Run zero-shot prompts with 'context enhancement' on held-out social posts and measure balanced accuracy versus a simple baseline.

Try few-shot prompts (1 example per class) for a priority binary task and compare performance lift.

If you have a few hundred labeled examples, finetune a small open LLM on multiple tasks to test cross-task gains.

Optimization Features

Infra Optimization
Used 8x A100 GPUs for finetuning
Training Optimization
Instruction finetuning across datasetsLoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

DreadditDepSeveritySDCNLCSSRS-SuicideRed-SamTwt-60UsersSAD

Risks & Boundaries

Limitations

Datasets are mainly Reddit; cross-platform coverage is limited though external tests were run.

Finetuning experiments exclude closed-source GPT-3.5/GPT-4 due to cost, so comparisons mix finetuned open models vs prompted closed models.

When Not To Use

Do not use these models for clinical diagnosis or unsupervised intervention without expert oversight.

Avoid deploying un-audited reasoning outputs directly to users because explanations can be plausible but wrong.

Failure Modes

False positives from literal descriptions of past anxiety or hypothetical scenarios.

Plausible-sounding but incorrect reasoning (hallucinated causal links).

Core Entities

Models

Alpaca (7B)LoRAFLAN-T5-XXL (11B)LLaMA2 (70B)GPT-3.5 (gpt-3.5-turbo, 175B)GPT-4 (gpt-4-0613, ~1700B)Mental-Alpaca (finetuned)Mental-FLAN-T5 (finetuned)

Metrics

Accuracy

Datasets

DreadditDepSeveritySDCNLCSSRS-SuicideRed-Sam (external)Twt-60Users (external)SAD (external)

Benchmarks

Task #1 Binary stress (Dreaddit)Task #2 Binary depression (DepSeverity)Task #3 Four-level depression (DepSeverity)Task #4 Binary suicide ideation (SDCNL)Task #5 Binary user-level suicide risk (CSSRS)Task #6 Five-level suicide risk (CSSRS)

Context Entities

Models

Mental-RoBERTa (task-specific baseline)BERT (baseline)

Metrics

Accuracy

Datasets

Public Reddit and Twitter mental-health datasets with human annotations