Instruction finetuning small open LLMs (Alpaca, FLAN-T5) boosts mental-health prediction to match or beat much larger models

July 26, 20238 min

Overview

Production Readiness

0.25

Novelty Score

0.6

Cost Impact Score

0.55

Citation Count

59

Authors

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, Dakuo Wang

Links

Abstract / PDF

Why It Matters For Business

Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.

Summary TLDR

The authors benchmarked Alpaca, Alpaca-LoRA, FLAN-T5, LLaMA2, GPT-3.5 and GPT-4 on six Reddit-based mental-health classification tasks. Zero- and few-shot prompting give useful but limited results. Instruction finetuning a small open model on multiple mental-health datasets (Mental-Alpaca, Mental-FLAN-T5) raised balanced accuracy by ~15–23% over zero-shot and outperformed the best zero/few-shot GPT-3.5 and GPT-4 on average, matching a task-specific state-of-the-art model. The models still make reasoning errors, show bias risks, and are not ready for clinical deployment.

Problem Statement

Can general-purpose LLMs read online text and reliably predict mental-health states? If not, which simple interventions (prompting, few-shot, instruction finetuning) improve performance across many tasks without training one task-specific model per label?

Main Contribution

A broad empirical evaluation of six LLMs on six mental-health classification tasks using multiple prompt strategies and finetuning setups.

Instruction-finetuned open models (Mental-Alpaca, Mental-FLAN-T5) released for multi-task mental-health prediction.

Practical guidelines on when to use prompts, few-shot, or instruction finetuning and on data-efficiency trade-offs.

A short case study of model reasoning and a discussion of ethical risks including demographic bias and unsafe explanations.

Key Findings

Instruction finetuning markedly improves performance over prompting.

NumbersAlpaca finetuned: +23.4% balanced accuracy vs Alpaca zero-shot

Small finetuned models can beat much larger closed models on these tasks.

NumbersMental-Alpaca and Mental-FLAN-T5 beat GPT-3.5 best prompts by ~10.9% and GPT-4 best prompts by ~4.8% (balanced accuracy)

Few-shot prompting gives modest but useful gains over zero-shot.

NumbersFew-shot vs zero-shot: +4.1% balanced accuracy on evaluated datasets

Prompt 'context enhancement' is the most reliably helpful prompt strategy.

NumbersContext prompts improved several models (e.g., Alpaca +2.1% across tasks)

Reasoning outputs remain brittle and can be misleading.

NumbersGPT-4 shows high-quality reasoning in case study examples but models also produced plausible-sounding but incorrect just

Finetuning needs surprisingly little data if varied.

NumbersWith 1%–5% of training data (few hundred samples) finetuned models already exceed zero-shot on most tasks

Results

Accuracy

ValueMental-Alpaca > GPT-3.5_best by ~10.9% (avg over tasks)

BaselineGPT-3.5 best zero/few-shot

Accuracy

ValueMental-* models > GPT-4_best by ~4.8% (avg over tasks)

BaselineGPT-4 best zero/few-shot

Accuracy

ValueFLAN-T5_ZS > Alpaca_ZS by 10.9% (avg)

BaselineAlpaca zero-shot

Few-shot vs zero-shot

ValueFew-shot improves balanced accuracy by ~4.1% on evaluated datasets

BaselineZero-shot prompts

Who Should Care

What To Try In 7 Days

Run zero-shot prompts with 'context enhancement' on held-out social posts and measure balanced accuracy versus a simple baseline.

Try few-shot prompts (1 example per class) for a priority binary task and compare performance lift.

If you have a few hundred labeled examples, finetune a small open LLM on multiple tasks to test cross-task gains.

Optimization Features

Infra Optimization

  • Used 8x A100 GPUs for finetuning

Training Optimization

  • Instruction finetuning across datasets
  • LoRA

Reproducibility

Data Urls

  • Dreaddit
  • DepSeverity
  • SDCNL
  • CSSRS-Suicide
  • Red-Sam
  • Twt-60Users
  • SAD

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Datasets are mainly Reddit; cross-platform coverage is limited though external tests were run.
  • Finetuning experiments exclude closed-source GPT-3.5/GPT-4 due to cost, so comparisons mix finetuned open models vs prompted closed models.
  • Reasoning evaluation is exploratory and not systematically quantified.
  • Lack of demographic labels prevents a thorough fairness analysis.

When Not To Use

  • Do not use these models for clinical diagnosis or unsupervised intervention without expert oversight.
  • Avoid deploying un-audited reasoning outputs directly to users because explanations can be plausible but wrong.
  • Do not rely on zero-shot prompts alone for high-stakes suicide-risk decisions.

Failure Modes

  • False positives from literal descriptions of past anxiety or hypothetical scenarios.
  • Plausible-sounding but incorrect reasoning (hallucinated causal links).
  • Bias across demographic groups not measured here may cause unequal performance.
  • Finetuned classification-only models can lose the ability to generate useful explanations.

Core Entities

Models

  • Alpaca (7B)
  • LoRA
  • FLAN-T5-XXL (11B)
  • LLaMA2 (70B)
  • GPT-3.5 (gpt-3.5-turbo, 175B)
  • GPT-4 (gpt-4-0613, ~1700B)
  • Mental-Alpaca (finetuned)
  • Mental-FLAN-T5 (finetuned)

Metrics

  • Accuracy

Datasets

  • Dreaddit
  • DepSeverity
  • SDCNL
  • CSSRS-Suicide
  • Red-Sam (external)
  • Twt-60Users (external)
  • SAD (external)

Benchmarks

  • Task #1 Binary stress (Dreaddit)
  • Task #2 Binary depression (DepSeverity)
  • Task #3 Four-level depression (DepSeverity)
  • Task #4 Binary suicide ideation (SDCNL)
  • Task #5 Binary user-level suicide risk (CSSRS)
  • Task #6 Five-level suicide risk (CSSRS)

Context Entities

Models

  • Mental-RoBERTa (task-specific baseline)
  • BERT (baseline)

Metrics

  • Accuracy

Datasets

  • Public Reddit and Twitter mental-health datasets with human annotations