Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
10
Why It Matters For Business
MolecularGPT lets teams try new property predictions with two labeled examples instead of costly dataset labeling, speeding early drug/material candidate screening and reducing need to retrain task‑specific models.
Summary TLDR
MolecularGPT fine-tunes an open LLM (LLaMA2-7B-chat) with a large set of natural-language instructions built from SMILES strings and structure-aware few-shot demonstrations. The tuned model runs zero‑ and few‑shot molecular property prediction without further task-specific training. On a suite of MoleculeNet/CYP450/QM9 benchmarks it achieves top average ranks for zero- and few-shot settings, beats LLaMA baselines by large margins (reported ~15.7% avg. uplift on classification vs LLaMA) and with 2-shot matches or exceeds supervised GNNs on several classification tasks. Code is published.
Problem Statement
Molecular property models need many labeled molecules and often fail to generalize to unseen tasks. Labeling molecules is costly. The field lacks an LLM that (a) understands molecular inputs, (b) keeps zero‑shot ability, and (c) supports few‑shot in‑context learning for new property tasks without further fine‑tuning.
Main Contribution
MolecularGPT: first instruction‑tuned LLM for generic molecular property prediction that supports zero‑ and few‑shot in‑context learning (ICL) without task finetuning.
Structure‑aware few‑shot instructions: retrieval of top‑K similar molecules (MACCS fingerprints, Tanimoto) inserted as labeled demonstrations in prompts.
Hybrid instruction set: mix of 0–4 shot instructions spanning ~1,000 tasks (3.5 GB tokens) to balance zero‑shot and few‑shot abilities.
Extensive evaluation on 10 downstream datasets (MoleculeNet, CYP450, QM9) showing strong zero/few‑shot performance and robustness to prompt phrasing and fingerprint choices.
Key Findings
MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.
With two in‑context examples MolecularGPT beats supervised GNNs on several classification tasks.
Compared to base LLaMA, MolecularGPT shows large average gains in zero‑shot prediction.
Hybrid tuning (mix of 0–4 shot instructions) balances zero‑shot and few‑shot abilities better than single‑shot mixes.
Results
Few‑shot average rank
2‑shot classification wins vs supervised GNNs
Zero‑shot improvement vs base LLaMA
Zero‑shot vs GIMLET
Who Should Care
What To Try In 7 Days
Run MolecularGPT (public code) on a small, domain task with 2 labeled examples and compare predictions to an existing GNN baseline.
Build hybrid prompts: include a short property description plus top‑2 similar molecule demos (MACCS fingerprints) and measure AUC/RMSE.
Replace internal prototype re‑training for quick screening by deploying zero‑ or two‑shot prompts and track candidate triage time saved.
Optimization Features
Token Efficiency
- 512 token max input length used; at most 4 examples in instruction
Infra Optimization
- Training done on 4×A800‑80G GPUs; inference on 1×RTX3090
Model Optimization
- LoRA
Training Optimization
- Deepspeed ZeRO stage 2
- FlashAttention‑2
- bfloat16 mixed precision
Reproducibility
Data Urls
- Public datasets used: MoleculeNet, CYP450, QM9 (cited in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- SMILES input ignores 3D geometry; limits capturing spatial molecular features.
- Focuses only on property prediction; not evaluated for molecule generation or optimization.
- Regression numeric accuracy lags supervised GNNs on some tasks; generating precise numbers is still challenging.
When Not To Use
- When 3D geometric information (conformation) is critical.
- When high‑precision numeric regression is required and large labeled sets are available.
- When you need tasks beyond property prediction (generation, optimization, captioning).
Failure Modes
- Model can learn shortcuts from demonstration labels if tuned heavily on few‑shot sets, harming zero‑shot generalization.
- Adding many retrieval examples introduces noise and can degrade performance past ~2 demonstrations.
- Numeric outputs (regression) can be unstable or less accurate than finetuned GNNs.
Core Entities
Models
- MolecularGPT
- LLaMA2-7B-chat
- GIMLET
- Galactica1.3B
- LLaMA
Metrics
- ROC‑AUC
- RMSE
- Average rank
- Top‑1 dataset wins
Datasets
- MoleculeNet
- CYP450
- QM9
- BACE
- HIV
- MUV
- Tox21
- ToxCast
- BBBP
- ESOL
- FreeSolv
- Lipo
Benchmarks
- Few‑shot molecular property prediction
- Zero‑shot molecular property prediction

