Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

August 21, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Nathaniel H. Park, Tiffany J. Callahan, James L. Hedrick, Tim Erdmann, Sara Capponi

Links

Abstract / PDF

Why It Matters For Business

Structure-aware embeddings let search and agents find chemical analogs and spectra faster, cutting researcher time for design and analysis and enabling automated, multimodal retrieval inside lab-facing agent workflows.

Summary TLDR

This paper shows that a chemistry foundation model (MoLFormer) can act as an embedding model to enable structure-focused retrieval across small molecules, polymers, and reactions. The authors build large Milvus vector stores (~2.5M small molecules, ~2.5M polymers, ~2M reactions), show that vector math (add/sub/scale/avg) and scalar weighting (molecular weight, Mn, dispersity) change search behavior, and pair MolFormer embeddings with OpenCLIP image embeddings to search spectra images. Those vector stores are exposed as tools inside a hierarchical, self-reflective multi-agent RAG system (LangGraph + LangChain) to answer chemistry queries. Code/data will be released on publication.

Problem Statement

Standard RAG in chemistry uses text embeddings and fingerprints, which struggle to retrieve information by chemical structure or by images (spectra). Researchers need semantic, structure-aware retrieval across molecules, polymers, and reaction SMILES and multimodal characterization images, integrated into agent workflows.

Main Contribution

Demonstrate MoLFormer embeddings enable structure-focused semantic retrieval for small molecules, polymers, and reactions.

Show vector arithmetic (add/sub/average) and scalar weighting (molecular weight, Mn, dispersity) steer retrieval results toward functional or property-based analogs.

Combine MoLFormer (structure) with OpenCLIP (images) to enable multimodal searches of characterization images like NMR.

Integrate vector stores as retrievers inside a hierarchical multi-agent, self-reflective RAG pipeline (LangGraph) with specialized worker agents.

Assemble large embedded collections used for evaluation: ~2.5M small molecules, ~2.5M polymers, ~2M reactions, and >1M synthetic polymers.

Key Findings

MoLFormer embeddings retrieve structurally close small-molecule analogs even when fingerprint metrics disagree.

Numbers2.5M small-molecule collection; cosine similarity up to 1.00 for identical hits

Vector arithmetic (add/sub/avg) on MolFormer embeddings yields meaningful functional-group or hybrid analogues.

NumbersTop hits often show cosine similarity >=0.87 in illustrative queries

Embedding polymers as weighted combinations of component embeddings lets you bias retrieval toward structure or properties.

Numbers~2M synthetic polymers; different embedding formulas gave high cosine but divergent Euclidean similarity

Multi-modal retrieval of characterization images works by cross-referencing MolFormer (structure) and OpenCLIP (image) embeddings.

NumbersNMR example top-match L2 distance = 0.0109

A hierarchical, self-reflective multi-agent RAG system can use those vector stores as tools and produce validated reports.

NumbersDemonstrations used 250k-entry subsets for agent demos; outputs checked by a domain expert

Results

collection_size

Value2.5M small-molecules

top_match_cosine

Value1.00 (identical compound)

nmr_image_match_l2

Value0.0109 (top rank L2)

polymer_synthetic_dataset_size

Value>1M synthetic polymers

agent_demo_collection

Value250k entries (demo subset)

Who Should Care

What To Try In 7 Days

Embed a small chemical subset with MoLFormer and index in Milvus to compare retrieval vs fingerprints.

Test vector math (add/sub/average) on embeddings to find hybrid functional-group analogs.

Embed a few spectra images with OpenCLIP and link them to structure embeddings for multimodal lookup.

Agent Features

Memory

  • retrieval memory via external Milvus vector collections
  • cross-referenced metadata linking structure and image vectors

Planning

  • adaptive query analysis (routing)
  • iterative retrieval and critique loops

Tool Use

  • vector-store retrievers (Milvus) as agent tools
  • embedding models (MoLFormer, OpenCLIP) called by agents

Frameworks

  • LangGraph
  • LangChain

Is Agentic

true

Architectures

  • hierarchical supervisor-worker multi-agent
  • self-reflective RAG worker agents

Collaboration

  • supervisor routes tasks to specialized worker agents
  • workers exchange intermediate checks and finalized answers

Optimization Features

Token Efficiency

  • Use vector retrievers to reduce LLM context needs

System Optimization

  • Select Milvus indices (HNSW or IVF_FLAT) per collection
  • L2-normalize embeddings where appropriate

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • MoLFormer was pretrained on SMILES <200 tokens; very large SMILES/macromolecules may be poorly represented
  • Polymer SMILES modeling uses simplified repeat-unit notation and ignores stochastic topology and end-groups
  • Demo multi-agent runs used reduced subsets (250k) rather than full collections
  • No large-scale quantitative benchmark vs. fingerprints across diverse tasks presented

When Not To Use

  • For detailed 3D-conformer-sensitive property predictions requiring explicit geometry
  • For polymers where stochastic sequence, branching, or full topology must be encoded
  • When regulatory traceability and reproducible code/data release are required before publication

Failure Modes

  • Fingerprint metrics can disagree with embedding similarity, causing ambiguous relevance judgments
  • Vector arithmetic may fail for rare or out-of-distribution chemotypes
  • Reaction SMILES ordering affects results; order sensitivity may mislead queries

Core Entities

Models

  • ibm/MoLFormer-XL-both-10pct (MoLFormer)
  • OpenCLIP ViT-g-14 (laion2b_s34b_b88k)
  • GPT-4o-mini (supervisor)
  • llava-7b (worker)
  • Llama3.1-8b (worker)

Metrics

  • cosine similarity
  • Euclidean similarity / L2 distance
  • Tanimoto (Morgan fingerprints)
  • RDKit similarity
  • MACCS similarity
  • Dice similarity

Datasets

  • ~2.5M small-molecule SMILES (open + historical)
  • ~2.5M polymer SMILES (open + historical)
  • ~2M reaction SMILES (USPTO + historical)
  • >1M synthetic polymers (enumerated with Mn, DPn, dispersity)
  • Labeled NMR image set (small, used for multimodal tests)