MolecularGPT — instruction‑tuned LLM that predicts molecular properties with zero‑ and few‑shot prompts

Overview

Decision SnapshotNeeds Validation

The approach is practical: instruction tuning plus nearest‑neighbor demos yields reliable zero/few‑shot prediction on public benchmarks, but performance varies by task and numeric regression remains harder.

Citations10

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yuyan Liu, Sirui Ding, Sheng Zhou, Wenqi Fan, Qiaoyu Tan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MolecularGPT lets teams try new property predictions with two labeled examples instead of costly dataset labeling, speeding early drug/material candidate screening and reducing need to retrain task‑specific models.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

MolecularGPT fine-tunes an open LLM (LLaMA2-7B-chat) with a large set of natural-language instructions built from SMILES strings and structure-aware few-shot demonstrations. The tuned model runs zero‑ and few‑shot molecular property prediction without further task-specific training. On a suite of MoleculeNet/CYP450/QM9 benchmarks it achieves top average ranks for zero- and few-shot settings, beats LLaMA baselines by large margins (reported ~15.7% avg. uplift on classification vs LLaMA) and with 2-shot matches or exceeds supervised GNNs on several classification tasks. Code is published.

Problem Statement

Molecular property models need many labeled molecules and often fail to generalize to unseen tasks. Labeling molecules is costly. The field lacks an LLM that (a) understands molecular inputs, (b) keeps zero‑shot ability, and (c) supports few‑shot in‑context learning for new property tasks without further fine‑tuning.

Main Contribution

MolecularGPT: first instruction‑tuned LLM for generic molecular property prediction that supports zero‑ and few‑shot in‑context learning (ICL) without task finetuning.

Structure‑aware few‑shot instructions: retrieval of top‑K similar molecules (MACCS fingerprints, Tanimoto) inserted as labeled demonstrations in prompts.

Key Findings

MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.

Numbers2‑shot average rank = 1.1; 8‑shot = 2.1 (Tab.1)

Practical UseUse MolecularGPT as a drop‑in few‑shot predictor when labeled examples per new task are very limited.

Evidence RefSection 4.2; Table 1

With two in‑context examples MolecularGPT beats supervised GNNs on several classification tasks.

NumbersOutperforms supervised GNNs on 4 of 7 classification datasets in 2‑shot

Practical UseWhen labeling budget is tiny (≈2 examples), prefer MolecularGPT over re‑training GNNs for fast prototyping.

Evidence RefAbstract; Section 4.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Few‑shot average rank	2‑shot avg rank = 1.1; 8‑shot avg rank = 2.1	Other few‑shot methods (GIMLET, Galactica1.3B, etc.)	Best average rank across compared models	Aggregated across classification downstream datasets (Table 1)	Table 1; Section 4.2	Table 1
2‑shot classification wins vs supervised GNNs	Wins on 4 out of 7 classification tasks	Supervised GNNs (GCN/GAT/GIN/Graphormer)	Outperforms per‑dataset on 4/7	Selected classification datasets (Table 1)	Section 4.2; Table 1	Table 1

What To Try In 7 Days

Run MolecularGPT (public code) on a small, domain task with 2 labeled examples and compare predictions to an existing GNN baseline.

Build hybrid prompts: include a short property description plus top‑2 similar molecule demos (MACCS fingerprints) and measure AUC/RMSE.

Replace internal prototype re‑training for quick screening by deploying zero‑ or two‑shot prompts and track candidate triage time saved.

Optimization Features

Token Efficiency

512 token max input length used; at most 4 examples in instruction

Infra Optimization

Training done on 4×A800‑80G GPUs; inference on 1×RTX3090

Model Optimization

LoRA

Training Optimization

Deepspeed ZeRO stage 2FlashAttention‑2bfloat16 mixed precision

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NYUSHCS/MolecularGPT

Data URLs

Public datasets used: MoleculeNet, CYP450, QM9 (cited in paper)

Risks & Boundaries

Limitations

SMILES input ignores 3D geometry; limits capturing spatial molecular features.

Focuses only on property prediction; not evaluated for molecule generation or optimization.

When Not To Use

When 3D geometric information (conformation) is critical.

When high‑precision numeric regression is required and large labeled sets are available.

Failure Modes

Model can learn shortcuts from demonstration labels if tuned heavily on few‑shot sets, harming zero‑shot generalization.

Adding many retrieval examples introduces noise and can degrade performance past ~2 demonstrations.

Core Entities

Models

MolecularGPTLLaMA2-7B-chatGIMLETGalactica1.3BLLaMA

Metrics

ROC‑AUCRMSEAverage rankTop‑1 dataset wins

Datasets

MoleculeNetCYP450QM9BACEHIVMUVTox21ToxCastBBBPESOLFreeSolvLipo

Benchmarks

Few‑shot molecular property predictionZero‑shot molecular property prediction

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.

With two in‑context examples MolecularGPT beats supervised GNNs on several classification tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding