CRAG‑MoW: a multi-agent, self‑corrective RAG system that benchmarks open LLMs on chemical search

February 26, 20259 min

Overview

Decision SnapshotReady For Pilot

CRAG‑MoW is a practical, modular system for domain retrieval+generation; evidence includes judged evaluations and pairwise wins, but results rely on an LLM judge and limited question sets, so human expert validation is still needed before full production.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Tiffany J. Callahan, Nathaniel H. Park, Sara Capponi

Links

Abstract / PDF

Why It Matters For Business

You can build a domain‑focused, multi‑agent retrieval+LLM pipeline using open models and get judged output quality close to closed SOTA (GPT‑4o) while increasing preference rate; this reduces vendor lock‑in and enables on‑prem, interpretable retrieval and model selection.

Who Should Care

Summary TLDR

This paper introduces CRAG‑MoW (Mixture‑of‑Workflows), a multi‑agent system that runs multiple self‑corrective retrieval‑augmented generation (RAG) workflows and an orchestration agent to synthesize results. The authors index chemistry data (SMILES and NMR) with MoLFormer/OpenCLIP embeddings in Milvus, run 19 agentic workflows built from nine open LLMs plus a GPT‑4o baseline, and evaluate with an LLM‑Judge (GPT‑4o‑mini). CRAG‑MoWs scored near GPT‑4o on the LLM‑Judge (7.12 vs. 7.59) and won more pairwise comparisons (top aggregator win rate 8.77% vs GPT‑4o 5.89%). Chemical reactions were the hardest retrieval task (many incomplete runs). Code/data to be released on publication.

Problem Statement

Materials and chemistry tasks need up‑to‑date, structured retrieval plus multi‑step reasoning. Single LLMs often lack reliable retrieval, hallucination checks, and domain benchmarks. There is no standard framework to 1) run multiple LLMs with structured RAG, 2) self‑correct retrieval and generation, and 3) compare open models to closed SOTA on the same tasks.

Main Contribution

CRAG‑MoW: a Mixture‑of‑Workflows architecture combining multiple self‑corrective RAG (CRAG) Generators with an Aggregator orchestration agent and reciprocal rank fusion (RRF) re‑ranking.

Applied CRAG‑MoW to chemistry: structure search over 250k small molecules, 250k polymers, 250k reactions, plus 2,259 experimental NMR spectra and NMRShiftDB2; multi‑modal retrieval (SMILES + NMR images).

Key Findings

CRAG‑MoW workflows reach LLM‑Judge scores close to GPT‑4o on evaluated tasks.

NumbersCRAG‑MoWs 7.12 vs GPT‑4o 7.59 (LLM‑Judge average, 110)

Practical UseYou can assemble open LLMs in a self‑corrective retrieval pipeline to get near‑SOTA judged quality without relying solely on closed APIs.

Evidence RefAbstract; Results (Individual Workflow Evaluation — All Collections)

Aggregated multi‑workflow responses were preferred more often in pairwise comparisons than GPT‑4o.

NumbersTop aggregator win rate 8.77% vs GPT‑4o 5.89% (pairwise win rates)

Practical UseUsing multiple specialized workflows plus an orchestration agent can produce outputs that human or LLM judges prefer over a single generic SOTA model on domain queries.

Evidence RefPairwise Model Evaluation (All Collections), Figure 4 and text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LLM‑Judge average score (1–10)CRAG‑MoWs 7.12; GPT‑4o 7.59GPT‑4o 7.59−0.47 (CRAG‑MoW vs GPT‑4o)All Collections (small molecules, polymers, reactions, NMR)Abstract; Results: Individual Workflow Evaluation — All CollectionsResults — Individual Workflow Evaluation; Figure 3
Pairwise win rate (average)Top Aggregator (mistral‑nemo) 8.77%; GPT‑4o 5.89%GPT‑4o 5.89%+2.88 percentage points (best aggregator vs GPT‑4o)All Collections (pairwise comparisons across 10 questions)Pairwise Model Evaluation — All Collections; Figure 4Results — Pairwise Model Evaluation; Supplemental Table 5

What To Try In 7 Days

Run a simple CRAG loop: index a small, domain vector store (MoLFormer embeddings + Milvus), run 2 generator LLMs, and implement an aggregator that RRF‑fusion ranks the retrieved do

Add a completeness and hallucination checker to the Generator loop to re‑query or regenerate when checks fail; log iteration counts to measure cost.

Run quick A/B pairwise comparisons with an LLM judge (or a small group of domain experts) to benchmark the aggregator vs a single LLM baseline.

Agent Features

Memory
vectorstore retrieval (Milvus) for short‑term retrieval memory
Planning
iterative retrieval and response generationquery rewriting / re‑retrieval when relevance checks fail
Tool Use
document relevance checkinghallucination detectiondocument fusion and reciprocal rank fusion (RRF)
Frameworks
LangChainLangGraphMilvus
Is Agentic

Yes

Architectures
Mixture-of-WorkflowsSelf-Corrective RAG (CRAG)Aggregator-Generator orchestration
Collaboration
aggregator synthesizes outputs from multiple generatorsreciprocal rank fusion integrates cross‑workflow evidence

Optimization Features

Infra Optimization
Milvus IVF_FLAT index (L2 and inner product hybrid search)LangChain orchestration to reuse prompts and tools
System Optimization
Separate vector collections per modality (small molecules, polymers, reactions)Use of MoLFormer chemistry embeddings and OpenCLIP image embeddings projected to 768 dims
Inference Optimization
Configurable recursion and revision limits to bound compute (recursion max 25, revisions max 10)Hybrid search with equal modality weighting and RRF re‑ranking to reduce downstream cost

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation relies on an LLM‑Judge (GPT‑4o‑mini); LLM-as‑judge bias and lack of extensive human labeling are noted.

Incomplete workflows concentrated in the chemical reaction collection indicate retrieval coverage gaps.

When Not To Use

Do not rely on CRAG‑MoW outputs for safety‑critical lab protocols without human expert review.

Avoid deployment as an unsupervised decision maker for high‑risk chemical synthesis or regulatory claims.

Failure Modes

Missing or low‑quality retrieval documents leading to 'NO RELEVANT RAG DOCUMENTS FOUND' or aborted outputs.

Over‑generation or hallucinations when revision limits are hit and models return unchecked answers.

Core Entities

Models

GPT-4omistral-nemo:12b-instructwizardlm2:7bqwen2.5:7b-instructmistral:7b-instructgemma2:9b-instructmixtral:8x7b-instructphi3.5:3.8b-mini-instructllama3.1:8b-instruct

Metrics

LLM‑Judge average score (1–10)Pairwise win rate (%)Workflow completion rate (%)CRAG iterations per step (counts)

Datasets

small molecules (250k)polymers (250k)chemical reactions (250k)experimental NMR spectra (2,259)NMRShiftDB2

Benchmarks

LLM-Judge (1–10 scoring)Pairwise win‑rate comparisons