Overview
The approach is practical: instruction tuning plus nearest‑neighbor demos yields reliable zero/few‑shot prediction on public benchmarks, but performance varies by task and numeric regression remains harder.
Citations10
Evidence Strength0.70
Confidence0.75
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 70%
Why It Matters For Business
MolecularGPT lets teams try new property predictions with two labeled examples instead of costly dataset labeling, speeding early drug/material candidate screening and reducing need to retrain task‑specific models.
Who Should Care
Summary TLDR
MolecularGPT fine-tunes an open LLM (LLaMA2-7B-chat) with a large set of natural-language instructions built from SMILES strings and structure-aware few-shot demonstrations. The tuned model runs zero‑ and few‑shot molecular property prediction without further task-specific training. On a suite of MoleculeNet/CYP450/QM9 benchmarks it achieves top average ranks for zero- and few-shot settings, beats LLaMA baselines by large margins (reported ~15.7% avg. uplift on classification vs LLaMA) and with 2-shot matches or exceeds supervised GNNs on several classification tasks. Code is published.
Problem Statement
Molecular property models need many labeled molecules and often fail to generalize to unseen tasks. Labeling molecules is costly. The field lacks an LLM that (a) understands molecular inputs, (b) keeps zero‑shot ability, and (c) supports few‑shot in‑context learning for new property tasks without further fine‑tuning.
Main Contribution
MolecularGPT: first instruction‑tuned LLM for generic molecular property prediction that supports zero‑ and few‑shot in‑context learning (ICL) without task finetuning.
Structure‑aware few‑shot instructions: retrieval of top‑K similar molecules (MACCS fingerprints, Tanimoto) inserted as labeled demonstrations in prompts.
Key Findings
MolecularGPT ranks top on average for few‑shot prediction across evaluated datasets.
With two in‑context examples MolecularGPT beats supervised GNNs on several classification tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Few‑shot average rank | 2‑shot avg rank = 1.1; 8‑shot avg rank = 2.1 | Other few‑shot methods (GIMLET, Galactica1.3B, etc.) | Best average rank across compared models | Aggregated across classification downstream datasets (Table 1) | Table 1; Section 4.2 | Table 1 |
| 2‑shot classification wins vs supervised GNNs | Wins on 4 out of 7 classification tasks | Supervised GNNs (GCN/GAT/GIN/Graphormer) | Outperforms per‑dataset on 4/7 | Selected classification datasets (Table 1) | Section 4.2; Table 1 | Table 1 |
What To Try In 7 Days
Run MolecularGPT (public code) on a small, domain task with 2 labeled examples and compare predictions to an existing GNN baseline.
Build hybrid prompts: include a short property description plus top‑2 similar molecule demos (MACCS fingerprints) and measure AUC/RMSE.
Replace internal prototype re‑training for quick screening by deploying zero‑ or two‑shot prompts and track candidate triage time saved.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
SMILES input ignores 3D geometry; limits capturing spatial molecular features.
Focuses only on property prediction; not evaluated for molecule generation or optimization.
When Not To Use
When 3D geometric information (conformation) is critical.
When high‑precision numeric regression is required and large labeled sets are available.
Failure Modes
Model can learn shortcuts from demonstration labels if tuned heavily on few‑shot sets, harming zero‑shot generalization.
Adding many retrieval examples introduces noise and can degrade performance past ~2 demonstrations.

