Auto-generate simulator-validated PFDs and PIDs to move AI-discovered chemicals to production

May 30, 202510 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Sakhinana Sagar Srinivas, Shivam Gupta, Venkataramana Runkana

Links

Abstract / PDF

Why It Matters For Business

Auto-generating simulator-validated PFDs/PIDs moves molecule discoveries toward manufacturability earlier, cutting manual engineering time, reducing late-stage rework, and accelerating commercialization decisions.

Summary TLDR

This paper presents a practical system that auto-generates industrial Process Flow Diagrams (PFDs) and Piping & Instrumentation Diagrams (PIDs) and validates them with a chemical process simulator (DWSIM). Key pieces: a 1,020+ chemical knowledge graph (ChemAtlas), 20K synthetic QA pairs for tuning small LMs (QLoRA), a Graph-RAG retrieval layer, an agentic web-navigation pipeline, multi-stage fine-tuning (SFT, DPO, RAIT, optional GRPO RL), and simulator-in-the-loop checks. Llama-3.2-1B variants approach GPT-4o on reward-model metrics; retrieval and feedback consistently improve zero-shot performance. The system includes practical inference optimizations (FlashAttention, PagedAttention + KV-4b

Problem Statement

AI-discovered molecules often lack manufacturable process designs. Generating accurate Process Flow Diagrams (PFDs) and Piping & Instrumentation Diagrams (PIDs) is a manual bottleneck. Existing LLM work rarely produces industrially executable schematics or uses physics simulators to verify mass/energy balances and control logic, leaving scale-up risks unaddressed and slowing commercialization.

Main Contribution

A closed-loop, physics-aware pipeline to auto-generate PFDs and PIDs and validate them in DWSIM (simulator-in-the-loop).

A hierarchical chemical knowledge graph (ChemAtlas) covering ~1,020 chemicals plus a held-out ChemEval set (100 chemicals) for zero-shot testing.

A teacher-student synthetic data stack (≈20K QA pairs + 1.5K OOD benchmark) used to fine-tune small LMs via SFT, DPO, RAIT and optional GRPO reinforcement learning.

Graph-RAG retrieval over community-summarized graph partitions to ground generation and reduce hallucination.

Engineering-focused inference optimizations (FlashAttention, PagedAttention + KV quantization, Lookahead Decoding, test-time inference scaling) and structured pruning for deployment efficiency.

Key Findings

A Graph RAG + feedback setup substantially improves zero-shot helpfulness and correctness.

NumbersZero-shot reward model score 'approaching 3.0' (0–4 scale) with GraphRAG+feedback on 1.5K benchmark.

A compact, fine-tuned SLM (Llama-3.2-1B) can approach a stronger baseline (GPT-4o) on reward-model metrics after domain tuning and retrieval.

NumbersRanking on ChemEval reward scores: GPT-4o (highest) > Llama-3.2-1B (second) > SmolLM2-135M.

The system produces executable flowsheets and control loops validated in an open-source simulator.

NumbersTwo example flowsheets (nitric and sulfuric acid) were converted to DWSIM and simulated for steady-state and dynamic PID

Inference and memory optimizations materially increase serving throughput and reduce latency.

NumbersPagedAttention + KV quantization: max batch 16 vs 8 and throughput ~100 vs 55 tokens/sec; Lookahead cut latency 40.5s→21

Synthetic dataset generation has measurable compute and carbon cost trade-offs.

NumbersSynDIP generation time ~2179.6 min and CO2 ≈ 1.25 kg; smaller datasets cost 0.15–0.4 kg CO2.

Results

Reward-model ranking (ChemEval, 0–4)

ValueGPT-4o > Llama-3.2-1B > SmolLM2-135M

BaselineGPT-4o

Zero-shot reward score (GraphRAG + feedback)

Valueapproaching 3.0 (0–4 scale)

BaselineLlama without GraphRAG/feedback

PagedAttention + KV quantization (runtime)

Valuemax batch 16 vs 8; throughput ~100 vs 55 tokens/sec

Baselinestandard contiguous KV cache

Lookahead Decoding (latency)

Value2048-token latency 40.5s → 21.3s

Baselinegreedy decoding

Synthetic dataset cost (SynDIP)

Valuegeneration time ≈2179.6 min; CO2 ≈1.25 kg

Baselinesimpler datasets (Factual QA)

Pruning computational time (width)

Valueevaluation time baseline 1350.2 min → 1066.2 min at 20% width pruning

Baseline0% pruning

Who Should Care

What To Try In 7 Days

Run a small LM (Llama-variant) to produce a PFD/PID for a familiar chemical and convert the text into a DWSIM flowsheet to see simulation gaps.

Build a minimal graph of 50 domain documents and test retrieval-conditioning (GraphRAG) to reduce hallucinations.

Fine-tune a compact model with a few dozen synthetic QA pairs (QLoRA) to evaluate improvement in technical answer quality offline.

Agent Features

Memory

  • Hierarchical knowledge graph (Neo4j) with community summaries
  • Memory DB for conversational history

Planning

  • DAG-based query decomposition
  • Agentic web navigation for multi-step retrieval

Tool Use

  • DWSIM (simulator)
  • GraphRAG
  • PagedAttention
  • FlashAttention
  • Lookahead Decoding

Frameworks

  • Agentic web navigation
  • Graph RAG
  • Test-time inference scaling

Is Agentic

true

Architectures

  • Meta-Agent orchestrator
  • Specialized SLMs (Llama-3.2-1B, SmolLM2-135M)
  • Critique-Agent

Collaboration

  • Meta-Agent coordinates expert agents (Visual Miner, Research, Patent, Wiki)
  • Critique-Agent provides iterative feedback and refinement

Optimization Features

Token Efficiency

  • Lookahead Decoding nearly halved latency for 2048-token sequences
  • PagedAttention increased throughput ≈1.8× in tests

Infra Optimization

  • Mixed precision training (BF16/FP16) on NVIDIA V100; H100 gives additional gains with FlashAttention

Model Optimization

  • Width and depth structured pruning with importance heuristics
  • LoRA
  • KV-cache group-wise quantization (INT4/8) with Hessian-aware scaling

System Optimization

  • KV cache quantization to shrink memory footprint
  • Block-wise KV paging to reduce fragmentation

Training Optimization

  • GRPO
  • Teacher-student synthetic data generation and reward-model filtering
  • Composite reward function combining ROUGE-L, length penalty, and LLM judge

Inference Optimization

  • PagedAttention (paged KV cache) for memory efficiency
  • Lookahead Decoding for multi-token speculative decoding
  • FlashAttention for I/O-aware attention compute
  • Test-time inference scaling (self-consistency, confidence entropy, self-reflection)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Ground-truth and many labels are synthetic and teacher-LLM generated, which may bias evaluations toward teacher models.
  • Evaluation relies heavily on a reward model (Nemotron-4-340B) and LLM judges; those judgments can be inconsistent with human engineering priorities.
  • ChemAtlas coverage (≈1,020 chemicals) is broad but not exhaustive; unusual chemistries may lack retrieval neighbors.
  • No public code or dataset release is described, limiting reproducibility and independent auditing.

When Not To Use

  • Do not use outputs as final engineering documents without domain expert review and safety sign-off.
  • Avoid relying on the system alone for highly novel chemistries not represented in the knowledge graph.
  • Not suitable for immediate regulatory submissions or safety-critical deployments without formal validation.

Failure Modes

  • Hallucinated equipment/specs that pass text-based checks but fail simulator conversion.
  • Missing numeric parameters (flows, pressures) causing DWSIM model convergence failure.
  • Pruning or quantization leading to dropped factual correctness for complex multi-step reasoning.
  • Reward-model optimization that favors verbosity or surface overlap over engineering correctness.

Core Entities

Models

  • Llama-3.2-1B
  • SmolLM2-135M
  • GPT-4o
  • Claude-3-Haiku
  • Nemotron-4-340B (reward model)

Metrics

  • Nemotron-4-340B reward (0–4 helpfulness/correctness/coherence/complexity/verbosity)
  • BLEU, ROUGE (1/2/L), METEOR, SacreBLEU, BERTScore, Sentence-BERT cosine similarity
  • Inference throughput (tokens/sec), latency (s), peak GPU memory (GB)

Datasets

  • ChemAtlas (≈1,020 chemicals)
  • ChemEval (100 held-out chemicals)
  • 20K synthetic QA dataset (FactualQA, SynDIP, LogiCore, DPO, Local/Global RAIT)
  • 1.5K OOD generalization benchmark

Benchmarks

  • ChemEval PFD/PID generation
  • 1.5K QA-pair generalization benchmark