A multi-agent system uses LLM planning, retrieval, and large-scale simulation to design peptide/protein binders for disordered proteins on a

Overview

Decision SnapshotNeeds Validation

The system is a functioning end-to-end prototype with concrete scaling and case-study results. Evidence is strong for in-silico performance and HPC scaling. Wet-lab validation and public release of orchestration code/data remain limited, reducing immediate production readiness.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Matthew Sinclair, Moeen Meigooni, Archit Vasan, Ozan Gokdemir, Xinran Lian, Heng Ma, Yadu Babuji, Alexander Brace, Khalid Hossain, Carlo Siebenschuh, Thomas Brettin, Kyle Chard, Christopher Henry, Venkatram Vishwanath, Rick L. Stevens, Ian T. Foster, Arvind Ramanathan

Links

Abstract / PDF / Code

Why It Matters For Business

Automating the end-to-end design loop lets teams generate and triage thousands of candidate biologics quickly. This cuts the early discovery cycle time and lets experimental teams focus on a smaller, higher-quality set for wet-lab testing. The system also shows how to map compute cost vs. value by filtering cheaply and

Who Should Care

CTO ML Engineer Product Manager Data Scientist Founder

Summary TLDR

StructBioReasoner is a multi-agent pipeline that combines retrieval-augmented LLM planning, structure prediction, molecular dynamics, and iterative binder design to target intrinsically disordered proteins (IDPs). On two case studies it produced large pools of in silico-validated binders: for Der f 21, 787 validated designs with 50.98% outperforming a literature reference by MM-PBSA; for NMNAT-2 it produced 97,066 validated binders and identified three binding modes including NMNAT-2:p53. The system runs at scale on the Aurora supercomputer, with MD sampling ≈26.6 µs/hour and multi-agent throughput measured in thousands of MM-PBSA calculations and ~15k peptides/hour design generation. Key I/

Problem Statement

Intrinsically disordered proteins (IDPs) lack a single stable 3D structure, so conventional design methods fail. Practitioners need an autonomous, scalable way to choose tools, reason across ensembles, and run expensive simulations to produce candidate biologics at scale.

Main Contribution

A tournament-style multi-agent architecture (StructBioReasoner) that lets specialized agents compete and refine binder hypotheses in parallel.

An integrated stack combining retrieval-augmented literature (HiPerRAG), LLM-driven planning, structure prediction, molecular dynamics, MM-PBSA scoring, and iterative binder design.

Key Findings

Der f 21: 50.98% of 787 in-silico validated designs had more favorable MM-PBSA binding free energy than the literature reference.

Numbers50.98% of 787 designs; 'more favorable' = mean ≤ -145.25 kcal/mol

Practical UseYou can use an agentic pipeline to generate hundreds of high-quality candidates quickly; expect a substantial fraction to beat an existing in-silico reference, but follow up with wet-lab validation because MM-PBSA is an/

Evidence RefSection 4.1; Figure 3B-C

NMNAT-2: 97,066 binders passed sequence and structural QC; analysis revealed three major binding modes, including a NMNAT-2:p53 interface.

Numbers97,066 validated binders (out of 266,606 generated); three binding modes

Practical UseAgentic, interactome-driven searches can find biologically relevant interfaces (e.g., p53) and produce large candidate sets for downstream screening and experimental follow-up.

Evidence RefSection 4.2; Figure 4A-D

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Der f 21 reference binding energy (MM-PBSA)	-135.00 ± 10.25 kcal/mol	—	—	Der f 21; reference binder 10	12 replicates, 600 ns total simulation	Section 4.1
Validated designs for Der f 21	787 designs passed QC and structural checks	842 total designed	93.47% pass rate	Der f 21 design campaign	Two design cycles; sequence+structure QC	Section 4.1

What To Try In 7 Days

Build a small domain vector store (10–100 papers) and attach a RAG layer to an LLM to ground planning for a single target.

Prototype a short agent loop: design 100 binders with an existing design tool, run quick MD (10 ns) and compute cheap interaction energies to triage top 10.

Benchmark a key analysis stage (MM-PBSA or surrogate) on available hardware to find I/O vs compute limits before scaling.

Agent Features

Memory

short-term memory trimming for context managementlong-term memory for persistent findings

Planning

LLM planning with structured state summariescross-hypothesis learning to form soft constraints

Tool Use

on-demand tool invocation (structure prediction, MD, MM-PBSA, design)human-in-the-loop checkpoints

Frameworks

Academy (execution layer)HiPerRAG (RAG layer)

Is Agentic

Yes

Architectures

tournament-based multi-agentLLM-driven planner-reasoner

Collaboration

agents compete in tournaments and share a knowledge graph

Optimization Features

Infra Optimization

Parsl and Globus Compute for federated execution; node scaling tuned to avoid I/O saturation

System Optimization

node-level parallelism, staging considerations called out

Training Optimization

direct preference optimization (DPO) for fine-tuning generative policy

Inference Optimization

multi-stage filtering to avoid expensive MM-PBSA on all candidates

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/msinclair-py/molecularsimulations https://github.com/joaomdmoura/crewAI (referenced frameworks)

Risks & Boundaries

Limitations

All biological efficacy claims are in silico (MM-PBSA approximations). Experimental validation required before therapeutic claims.

MM-PBSA is approximate and sensitive to simulation length and forcefield choices.

When Not To Use

When you need immediate wet-lab validated candidates without further experiments.

If you lack access to large HPC resources or cannot tolerate high I/O demands.

Failure Modes

MD crashes due to bad inputs (NaN coordinates, segmentation faults) requiring diagnostic agents.

Silent file-format corruption during CIF→PDB conversion causing downstream failures.

Core Entities

Models

RFDiffusionBindCraftAlphaFold 3AlphaFold-MultimerChai-1Boltz-2xBioMNIGPT-OSS-120BESM-2 650MProteinMPNN

Metrics

Binding free energy (MM-PBSA, kcal/mol)RMSD / RMSF (stability)Aggregate MD sampling (µs/hour)Agent parallel efficiency (%)Design throughput (peptides/hour)

Datasets

Custom HiPerRAG vector store (≈1,520 NMNAT-2 + ≈38 Der f 21 papers)PDBDisProt (mentioned)

Benchmarks

Der f 21NMNAT-2

Context Entities

Models

OpenFoldxTrimo-PGLMGenSLM

Metrics

MM-PBSA std. dev. (kcal/mol)I/O parallel efficiency

Datasets

Europe PMCOpenAlexCrossrefUnpaywall

Benchmarks

DisProt

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Der f 21: 50.98% of 787 in-silico validated designs had more favorable MM-PBSA binding free energy than the literature reference.

NMNAT-2: 97,066 binders passed sequence and structural QC; analysis revealed three major binding modes, including a NMNAT-2:p53 interface.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding