A multi-agent system uses LLM planning, retrieval, and large-scale simulation to design peptide/protein binders for disordered proteins on a

December 17, 202510 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Matthew Sinclair, Moeen Meigooni, Archit Vasan, Ozan Gokdemir, Xinran Lian, Heng Ma, Yadu Babuji, Alexander Brace, Khalid Hossain, Carlo Siebenschuh, Thomas Brettin, Kyle Chard, Christopher Henry, Venkatram Vishwanath, Rick L. Stevens, Ian T. Foster, Arvind Ramanathan

Links

Abstract / PDF

Why It Matters For Business

Automating the end-to-end design loop lets teams generate and triage thousands of candidate biologics quickly. This cuts the early discovery cycle time and lets experimental teams focus on a smaller, higher-quality set for wet-lab testing. The system also shows how to map compute cost vs. value by filtering cheaply and

Summary TLDR

StructBioReasoner is a multi-agent pipeline that combines retrieval-augmented LLM planning, structure prediction, molecular dynamics, and iterative binder design to target intrinsically disordered proteins (IDPs). On two case studies it produced large pools of in silico-validated binders: for Der f 21, 787 validated designs with 50.98% outperforming a literature reference by MM-PBSA; for NMNAT-2 it produced 97,066 validated binders and identified three binding modes including NMNAT-2:p53. The system runs at scale on the Aurora supercomputer, with MD sampling ≈26.6 µs/hour and multi-agent throughput measured in thousands of MM-PBSA calculations and ~15k peptides/hour design generation. Key I/

Problem Statement

Intrinsically disordered proteins (IDPs) lack a single stable 3D structure, so conventional design methods fail. Practitioners need an autonomous, scalable way to choose tools, reason across ensembles, and run expensive simulations to produce candidate biologics at scale.

Main Contribution

A tournament-style multi-agent architecture (StructBioReasoner) that lets specialized agents compete and refine binder hypotheses in parallel.

An integrated stack combining retrieval-augmented literature (HiPerRAG), LLM-driven planning, structure prediction, molecular dynamics, MM-PBSA scoring, and iterative binder design.

Scaling demonstration and empirical case studies: Der f 21 (787 validated designs; >50% beat reference in silico) and NMNAT-2 (97,066 validated binders; three binding modes discovered), run on Aurora with measured agent throughput and I/O bottleneck analysis.

Key Findings

Der f 21: 50.98% of 787 in-silico validated designs had more favorable MM-PBSA binding free energy than the literature reference.

Numbers50.98% of 787 designs; 'more favorable' = mean ≤ -145.25 kcal/mol

NMNAT-2: 97,066 binders passed sequence and structural QC; analysis revealed three major binding modes, including a NMNAT-2:p53 interface.

Numbers97,066 validated binders (out of 266,606 generated); three binding modes

Scaling: MD agent achieved ~26.6 microseconds aggregate sampling per hour and 80.4% parallel efficiency at 256 nodes; MM-PBSA processed >4,000 calculations/hour at 64 nodes; design agent produced ~15,000 peptides/hour with 84.4% efficiency at 256 nodes.

NumbersMD: 26.6 µs/hour, 80.4% eff @256 nodes; MM-PBSA: >4,000/hr @64 nodes; Design: ~15k peptides/hr, 84.4% eff @256 nodes

HiPerRAG vector store: literature corpus built from ~1,520 NMNAT-2 papers and ~38 Der f 21 papers to ground LLM reasoning and construct a shared knowledge graph.

NumbersVector store: ~1,520 NMNAT-2 and ~38 Der f 21 full-text papers

Results

Der f 21 reference binding energy (MM-PBSA)

Value-135.00 ± 10.25 kcal/mol

Validated designs for Der f 21

Value787 designs passed QC and structural checks

Baseline842 total designed

Fraction outperforming reference (Der f 21)

Value50.98% of 787

Baselinereference mean -135.00 kcal/mol

NMNAT-2 validated binders

Value97,066 binders passed sequence and structural validation

Baseline266,606 binders generated

MD throughput (aggregate sampling)

Value≈26.6 µs/hour

Baselinebaseline at 64 nodes

MM-PBSA throughput

Value>4,000 calculations/hour

Baseline64 nodes

Binder design throughput

Value≈15,000 peptides/hour (filtering pipeline)

Baselinedesign agent baseline at 64 nodes

Who Should Care

What To Try In 7 Days

Build a small domain vector store (10–100 papers) and attach a RAG layer to an LLM to ground planning for a single target.

Prototype a short agent loop: design 100 binders with an existing design tool, run quick MD (10 ns) and compute cheap interaction energies to triage top 10.

Benchmark a key analysis stage (MM-PBSA or surrogate) on available hardware to find I/O vs compute limits before scaling.

Agent Features

Memory

  • short-term memory trimming for context management
  • long-term memory for persistent findings

Planning

  • LLM planning with structured state summaries
  • cross-hypothesis learning to form soft constraints

Tool Use

  • on-demand tool invocation (structure prediction, MD, MM-PBSA, design)
  • human-in-the-loop checkpoints

Frameworks

  • Academy (execution layer)
  • HiPerRAG (RAG layer)

Is Agentic

true

Architectures

  • tournament-based multi-agent
  • LLM-driven planner-reasoner

Collaboration

  • agents compete in tournaments and share a knowledge graph

Optimization Features

Infra Optimization

  • Parsl and Globus Compute for federated execution; node scaling tuned to avoid I/O saturation

System Optimization

  • node-level parallelism, staging considerations called out

Training Optimization

  • direct preference optimization (DPO) for fine-tuning generative policy

Inference Optimization

  • multi-stage filtering to avoid expensive MM-PBSA on all candidates

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • All biological efficacy claims are in silico (MM-PBSA approximations). Experimental validation required before therapeutic claims.
  • MM-PBSA is approximate and sensitive to simulation length and forcefield choices.
  • I/O bottlenecks limit scaling of MM-PBSA beyond ~64 nodes; file-system contention affects throughput.
  • LLM reasoning remains dependent on curated literature; residual hallucinations possible without careful RAG curation.

When Not To Use

  • When you need immediate wet-lab validated candidates without further experiments.
  • If you lack access to large HPC resources or cannot tolerate high I/O demands.
  • For targets where single-structure methods suffice (ordered proteins) and simpler pipelines are cheaper.

Failure Modes

  • MD crashes due to bad inputs (NaN coordinates, segmentation faults) requiring diagnostic agents.
  • Silent file-format corruption during CIF→PDB conversion causing downstream failures.
  • I/O saturation on parallel filesystem that negates compute scaling.
  • LLM hallucinations if the RAG corpus is incomplete or noisy.

Core Entities

Models

  • RFDiffusion
  • BindCraft
  • AlphaFold 3
  • AlphaFold-Multimer
  • Chai-1
  • Boltz-2x
  • BioMNI
  • GPT-OSS-120B
  • ESM-2 650M
  • ProteinMPNN

Metrics

  • Binding free energy (MM-PBSA, kcal/mol)
  • RMSD / RMSF (stability)
  • Aggregate MD sampling (µs/hour)
  • Agent parallel efficiency (%)
  • Design throughput (peptides/hour)

Datasets

  • Custom HiPerRAG vector store (≈1,520 NMNAT-2 + ≈38 Der f 21 papers)
  • PDB
  • DisProt (mentioned)

Benchmarks

  • Der f 21
  • NMNAT-2

Context Entities

Models

  • OpenFold
  • xTrimo-PGLM
  • GenSLM

Metrics

  • MM-PBSA std. dev. (kcal/mol)
  • I/O parallel efficiency

Datasets

  • Europe PMC
  • OpenAlex
  • Crossref
  • Unpaywall

Benchmarks

  • DisProt