MolecularGPT — instruction‑tuned LLM that predicts molecular properties with zero‑ and few‑shot prompts

June 18, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

10

Authors

Yuyan Liu, Sirui Ding, Sheng Zhou, Wenqi Fan, Qiaoyu Tan

Links

Abstract / PDF

Why It Matters For Business

MolecularGPT lets teams try new property predictions with two labeled examples instead of costly dataset labeling, speeding early drug/material candidate screening and reducing need to retrain task‑specific models.

Summary TLDR

MolecularGPT fine-tunes an open LLM (LLaMA2-7B-chat) with a large set of natural-language instructions built from SMILES strings and structure-aware few-shot demonstrations. The tuned model runs zero‑ and few‑shot molecular property prediction without further task-specific training. On a suite of MoleculeNet/CYP450/QM9 benchmarks it achieves top average ranks for zero- and few-shot settings, beats LLaMA baselines by large margins (reported ~15.7% avg. uplift on classification vs LLaMA) and with 2-shot matches or exceeds supervised GNNs on several classification tasks. Code is published.

Problem Statement

Molecular property models need many labeled molecules and often fail to generalize to unseen tasks. Labeling molecules is costly. The field lacks an LLM that (a) understands molecular inputs, (b) keeps zero‑shot ability, and (c) supports few‑shot in‑context learning for new property tasks without further fine‑tuning.

Main Contribution

MolecularGPT: first instruction‑tuned LLM for generic molecular property prediction that supports zero‑ and few‑shot in‑context learning (ICL) without task finetuning.

Structure‑aware few‑shot instructions: retrieval of top‑K similar molecules (MACCS fingerprints, Tanimoto) inserted as labeled demonstrations in prompts.

Hybrid instruction set: mix of 0–4 shot instructions spanning ~1,000 tasks (3.5 GB tokens) to balance zero‑shot and few‑shot abilities.

Extensive evaluation on 10 downstream datasets (MoleculeNet, CYP450, QM9) showing strong zero/few‑shot performance and robustness to prompt phrasing and fingerprint choices.

Key Findings

MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.

Numbers2‑shot average rank = 1.1; 8‑shot = 2.1 (Tab.1)

With two in‑context examples MolecularGPT beats supervised GNNs on several classification tasks.

NumbersOutperforms supervised GNNs on 4 of 7 classification datasets in 2‑shot

Compared to base LLaMA, MolecularGPT shows large average gains in zero‑shot prediction.

Numbers≈15.7% avg improvement on classification; 17.9 decrease in regression RMSE vs base LLaMA (reported)

Hybrid tuning (mix of 0–4 shot instructions) balances zero‑shot and few‑shot abilities better than single‑shot mixes.

Numbers0&4‑shot and 0–4‑shot models outperform pure 0‑shot or 4‑shot in 0/2‑shot inference (Fig.3)

Results

Few‑shot average rank

Value2‑shot avg rank = 1.1; 8‑shot avg rank = 2.1

BaselineOther few‑shot methods (GIMLET, Galactica1.3B, etc.)

2‑shot classification wins vs supervised GNNs

ValueWins on 4 out of 7 classification tasks

BaselineSupervised GNNs (GCN/GAT/GIN/Graphormer)

Zero‑shot improvement vs base LLaMA

Value≈15.7% avg increase (classification); 17.9 decrease in regression RMSE

BaselineBase LLaMA

Zero‑shot vs GIMLET

ValueOutperforms GIMLET on 5 of 10 datasets; avg improvement noted on some classification/regression sets

BaselineGIMLET

Who Should Care

What To Try In 7 Days

Run MolecularGPT (public code) on a small, domain task with 2 labeled examples and compare predictions to an existing GNN baseline.

Build hybrid prompts: include a short property description plus top‑2 similar molecule demos (MACCS fingerprints) and measure AUC/RMSE.

Replace internal prototype re‑training for quick screening by deploying zero‑ or two‑shot prompts and track candidate triage time saved.

Optimization Features

Token Efficiency

  • 512 token max input length used; at most 4 examples in instruction

Infra Optimization

  • Training done on 4×A800‑80G GPUs; inference on 1×RTX3090

Model Optimization

  • LoRA

Training Optimization

  • Deepspeed ZeRO stage 2
  • FlashAttention‑2
  • bfloat16 mixed precision

Reproducibility

Data Urls

  • Public datasets used: MoleculeNet, CYP450, QM9 (cited in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SMILES input ignores 3D geometry; limits capturing spatial molecular features.
  • Focuses only on property prediction; not evaluated for molecule generation or optimization.
  • Regression numeric accuracy lags supervised GNNs on some tasks; generating precise numbers is still challenging.

When Not To Use

  • When 3D geometric information (conformation) is critical.
  • When high‑precision numeric regression is required and large labeled sets are available.
  • When you need tasks beyond property prediction (generation, optimization, captioning).

Failure Modes

  • Model can learn shortcuts from demonstration labels if tuned heavily on few‑shot sets, harming zero‑shot generalization.
  • Adding many retrieval examples introduces noise and can degrade performance past ~2 demonstrations.
  • Numeric outputs (regression) can be unstable or less accurate than finetuned GNNs.

Core Entities

Models

  • MolecularGPT
  • LLaMA2-7B-chat
  • GIMLET
  • Galactica1.3B
  • LLaMA

Metrics

  • ROC‑AUC
  • RMSE
  • Average rank
  • Top‑1 dataset wins

Datasets

  • MoleculeNet
  • CYP450
  • QM9
  • BACE
  • HIV
  • MUV
  • Tox21
  • ToxCast
  • BBBP
  • ESOL
  • FreeSolv
  • Lipo

Benchmarks

  • Few‑shot molecular property prediction
  • Zero‑shot molecular property prediction