MolecularGPT — instruction‑tuned LLM that predicts molecular properties with zero‑ and few‑shot prompts

June 18, 20247 min

Overview

Decision SnapshotNeeds Validation

The approach is practical: instruction tuning plus nearest‑neighbor demos yields reliable zero/few‑shot prediction on public benchmarks, but performance varies by task and numeric regression remains harder.

Citations10

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yuyan Liu, Sirui Ding, Sheng Zhou, Wenqi Fan, Qiaoyu Tan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MolecularGPT lets teams try new property predictions with two labeled examples instead of costly dataset labeling, speeding early drug/material candidate screening and reducing need to retrain task‑specific models.

Who Should Care

Summary TLDR

MolecularGPT fine-tunes an open LLM (LLaMA2-7B-chat) with a large set of natural-language instructions built from SMILES strings and structure-aware few-shot demonstrations. The tuned model runs zero‑ and few‑shot molecular property prediction without further task-specific training. On a suite of MoleculeNet/CYP450/QM9 benchmarks it achieves top average ranks for zero- and few-shot settings, beats LLaMA baselines by large margins (reported ~15.7% avg. uplift on classification vs LLaMA) and with 2-shot matches or exceeds supervised GNNs on several classification tasks. Code is published.

Problem Statement

Molecular property models need many labeled molecules and often fail to generalize to unseen tasks. Labeling molecules is costly. The field lacks an LLM that (a) understands molecular inputs, (b) keeps zero‑shot ability, and (c) supports few‑shot in‑context learning for new property tasks without further fine‑tuning.

Main Contribution

MolecularGPT: first instruction‑tuned LLM for generic molecular property prediction that supports zero‑ and few‑shot in‑context learning (ICL) without task finetuning.

Structure‑aware few‑shot instructions: retrieval of top‑K similar molecules (MACCS fingerprints, Tanimoto) inserted as labeled demonstrations in prompts.

Key Findings

MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.

Numbers2‑shot average rank = 1.1; 8‑shot = 2.1 (Tab.1)

Practical UseUse MolecularGPT as a drop‑in few‑shot predictor when labeled examples per new task are very limited.

Evidence RefSection 4.2; Table 1

With two in‑context examples MolecularGPT beats supervised GNNs on several classification tasks.

NumbersOutperforms supervised GNNs on 4 of 7 classification datasets in 2‑shot

Practical UseWhen labeling budget is tiny (≈2 examples), prefer MolecularGPT over re‑training GNNs for fast prototyping.

Evidence RefAbstract; Section 4.2; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Few‑shot average rank2‑shot avg rank = 1.1; 8‑shot avg rank = 2.1Other few‑shot methods (GIMLET, Galactica1.3B, etc.)Best average rank across compared modelsAggregated across classification downstream datasets (Table 1)Table 1; Section 4.2Table 1
2‑shot classification wins vs supervised GNNsWins on 4 out of 7 classification tasksSupervised GNNs (GCN/GAT/GIN/Graphormer)Outperforms per‑dataset on 4/7Selected classification datasets (Table 1)Section 4.2; Table 1Table 1

What To Try In 7 Days

Run MolecularGPT (public code) on a small, domain task with 2 labeled examples and compare predictions to an existing GNN baseline.

Build hybrid prompts: include a short property description plus top‑2 similar molecule demos (MACCS fingerprints) and measure AUC/RMSE.

Replace internal prototype re‑training for quick screening by deploying zero‑ or two‑shot prompts and track candidate triage time saved.

Optimization Features

Token Efficiency
512 token max input length used; at most 4 examples in instruction
Infra Optimization
Training done on 4×A800‑80G GPUs; inference on 1×RTX3090
Model Optimization
LoRA
Training Optimization
Deepspeed ZeRO stage 2FlashAttention‑2bfloat16 mixed precision

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Public datasets used: MoleculeNet, CYP450, QM9 (cited in paper)

Risks & Boundaries

Limitations

SMILES input ignores 3D geometry; limits capturing spatial molecular features.

Focuses only on property prediction; not evaluated for molecule generation or optimization.

When Not To Use

When 3D geometric information (conformation) is critical.

When high‑precision numeric regression is required and large labeled sets are available.

Failure Modes

Model can learn shortcuts from demonstration labels if tuned heavily on few‑shot sets, harming zero‑shot generalization.

Adding many retrieval examples introduces noise and can degrade performance past ~2 demonstrations.

Core Entities

Models

MolecularGPTLLaMA2-7B-chatGIMLETGalactica1.3BLLaMA

Metrics

ROC‑AUCRMSEAverage rankTop‑1 dataset wins

Datasets

MoleculeNetCYP450QM9BACEHIVMUVTox21ToxCastBBBPESOLFreeSolvLipo

Benchmarks

Few‑shot molecular property predictionZero‑shot molecular property prediction