CRAG‑MoW: a multi-agent, self‑corrective RAG system that benchmarks open LLMs on chemical search

February 26, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Tiffany J. Callahan, Nathaniel H. Park, Sara Capponi

Links

Abstract / PDF

Why It Matters For Business

You can build a domain‑focused, multi‑agent retrieval+LLM pipeline using open models and get judged output quality close to closed SOTA (GPT‑4o) while increasing preference rate; this reduces vendor lock‑in and enables on‑prem, interpretable retrieval and model selection.

Summary TLDR

This paper introduces CRAG‑MoW (Mixture‑of‑Workflows), a multi‑agent system that runs multiple self‑corrective retrieval‑augmented generation (RAG) workflows and an orchestration agent to synthesize results. The authors index chemistry data (SMILES and NMR) with MoLFormer/OpenCLIP embeddings in Milvus, run 19 agentic workflows built from nine open LLMs plus a GPT‑4o baseline, and evaluate with an LLM‑Judge (GPT‑4o‑mini). CRAG‑MoWs scored near GPT‑4o on the LLM‑Judge (7.12 vs. 7.59) and won more pairwise comparisons (top aggregator win rate 8.77% vs GPT‑4o 5.89%). Chemical reactions were the hardest retrieval task (many incomplete runs). Code/data to be released on publication.

Problem Statement

Materials and chemistry tasks need up‑to‑date, structured retrieval plus multi‑step reasoning. Single LLMs often lack reliable retrieval, hallucination checks, and domain benchmarks. There is no standard framework to 1) run multiple LLMs with structured RAG, 2) self‑correct retrieval and generation, and 3) compare open models to closed SOTA on the same tasks.

Main Contribution

CRAG‑MoW: a Mixture‑of‑Workflows architecture combining multiple self‑corrective RAG (CRAG) Generators with an Aggregator orchestration agent and reciprocal rank fusion (RRF) re‑ranking.

Applied CRAG‑MoW to chemistry: structure search over 250k small molecules, 250k polymers, 250k reactions, plus 2,259 experimental NMR spectra and NMRShiftDB2; multi‑modal retrieval (SMILES + NMR images).

Systematic evaluation using an LLM‑Judge (GPT‑4o‑mini) on per‑workflow 1–10 scoring and exhaustive pairwise comparisons; reported both average scores and pairwise win rates.

Empirical result: properly orchestrated multi‑workflow systems built from open LLMs can match or be preferred to GPT‑4o on judged outputs, highlighting model‑specific strengths across chemical modalities.

Key Findings

CRAG‑MoW workflows reach LLM‑Judge scores close to GPT‑4o on evaluated tasks.

NumbersCRAG‑MoWs 7.12 vs GPT‑4o 7.59 (LLM‑Judge average, 1–10)

Aggregated multi‑workflow responses were preferred more often in pairwise comparisons than GPT‑4o.

NumbersTop aggregator win rate 8.77% vs GPT‑4o 5.89% (pairwise win rates)

Chemical reaction queries caused the most incomplete CRAG runs and highest retrieval/rewrite counts.

NumbersReaction collection completion: 73.68% and 78.95% (questions 1 & 2); average retrievals 5.50 per run

Results

LLM‑Judge average score (1–10)

ValueCRAG‑MoWs 7.12; GPT‑4o 7.59

BaselineGPT‑4o 7.59

Pairwise win rate (average)

ValueTop Aggregator (mistral‑nemo) 8.77%; GPT‑4o 5.89%

BaselineGPT‑4o 5.89%

Workflow completion rate (chemical reaction collection)

ValueQ1 73.68%; Q2 78.95%

Who Should Care

What To Try In 7 Days

Run a simple CRAG loop: index a small, domain vector store (MoLFormer embeddings + Milvus), run 2 generator LLMs, and implement an aggregator that RRF‑fusion ranks the retrieved do

Add a completeness and hallucination checker to the Generator loop to re‑query or regenerate when checks fail; log iteration counts to measure cost.

Run quick A/B pairwise comparisons with an LLM judge (or a small group of domain experts) to benchmark the aggregator vs a single LLM baseline.

Agent Features

Memory

  • vectorstore retrieval (Milvus) for short‑term retrieval memory

Planning

  • iterative retrieval and response generation
  • query rewriting / re‑retrieval when relevance checks fail

Tool Use

  • document relevance checking
  • hallucination detection
  • document fusion and reciprocal rank fusion (RRF)

Frameworks

  • LangChain
  • LangGraph
  • Milvus

Is Agentic

true

Architectures

  • Mixture-of-Workflows
  • Self-Corrective RAG (CRAG)
  • Aggregator-Generator orchestration

Collaboration

  • aggregator synthesizes outputs from multiple generators
  • reciprocal rank fusion integrates cross‑workflow evidence

Optimization Features

Infra Optimization

  • Milvus IVF_FLAT index (L2 and inner product hybrid search)
  • LangChain orchestration to reuse prompts and tools

System Optimization

  • Separate vector collections per modality (small molecules, polymers, reactions)
  • Use of MoLFormer chemistry embeddings and OpenCLIP image embeddings projected to 768 dims

Inference Optimization

  • Configurable recursion and revision limits to bound compute (recursion max 25, revisions max 10)
  • Hybrid search with equal modality weighting and RRF re‑ranking to reduce downstream cost

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation relies on an LLM‑Judge (GPT‑4o‑mini); LLM-as‑judge bias and lack of extensive human labeling are noted.
  • Incomplete workflows concentrated in the chemical reaction collection indicate retrieval coverage gaps.
  • Results use 10 benchmark questions per collection — small question set limits statistical generality.
  • Some open LLMs required fallback tool models for tool‑use tasks, adding heterogeneity to comparisons.

When Not To Use

  • Do not rely on CRAG‑MoW outputs for safety‑critical lab protocols without human expert review.
  • Avoid deployment as an unsupervised decision maker for high‑risk chemical synthesis or regulatory claims.
  • Not ideal when low latency is the top priority (multi‑agent orchestration adds overhead).

Failure Modes

  • Missing or low‑quality retrieval documents leading to 'NO RELEVANT RAG DOCUMENTS FOUND' or aborted outputs.
  • Over‑generation or hallucinations when revision limits are hit and models return unchecked answers.
  • Aggregator synthesizes conflicting or low‑quality generator outputs if multiple generators fail similarly.

Core Entities

Models

  • GPT-4o
  • mistral-nemo:12b-instruct
  • wizardlm2:7b
  • qwen2.5:7b-instruct
  • mistral:7b-instruct
  • gemma2:9b-instruct
  • mixtral:8x7b-instruct
  • phi3.5:3.8b-mini-instruct
  • llama3.1:8b-instruct

Metrics

  • LLM‑Judge average score (1–10)
  • Pairwise win rate (%)
  • Workflow completion rate (%)
  • CRAG iterations per step (counts)

Datasets

  • small molecules (250k)
  • polymers (250k)
  • chemical reactions (250k)
  • experimental NMR spectra (2,259)
  • NMRShiftDB2

Benchmarks

  • LLM-Judge (1–10 scoring)
  • Pairwise win‑rate comparisons