CRAG‑MoW: a multi-agent, self‑corrective RAG system that benchmarks open LLMs on chemical search

Overview

Decision SnapshotReady For Pilot

CRAG‑MoW is a practical, modular system for domain retrieval+generation; evidence includes judged evaluations and pairwise wins, but results rely on an LLM judge and limited question sets, so human expert validation is still needed before full production.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Tiffany J. Callahan, Nathaniel H. Park, Sara Capponi

Links

Abstract / PDF

Why It Matters For Business

You can build a domain‑focused, multi‑agent retrieval+LLM pipeline using open models and get judged output quality close to closed SOTA (GPT‑4o) while increasing preference rate; this reduces vendor lock‑in and enables on‑prem, interpretable retrieval and model selection.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

This paper introduces CRAG‑MoW (Mixture‑of‑Workflows), a multi‑agent system that runs multiple self‑corrective retrieval‑augmented generation (RAG) workflows and an orchestration agent to synthesize results. The authors index chemistry data (SMILES and NMR) with MoLFormer/OpenCLIP embeddings in Milvus, run 19 agentic workflows built from nine open LLMs plus a GPT‑4o baseline, and evaluate with an LLM‑Judge (GPT‑4o‑mini). CRAG‑MoWs scored near GPT‑4o on the LLM‑Judge (7.12 vs. 7.59) and won more pairwise comparisons (top aggregator win rate 8.77% vs GPT‑4o 5.89%). Chemical reactions were the hardest retrieval task (many incomplete runs). Code/data to be released on publication.

Problem Statement

Materials and chemistry tasks need up‑to‑date, structured retrieval plus multi‑step reasoning. Single LLMs often lack reliable retrieval, hallucination checks, and domain benchmarks. There is no standard framework to 1) run multiple LLMs with structured RAG, 2) self‑correct retrieval and generation, and 3) compare open models to closed SOTA on the same tasks.

Main Contribution

CRAG‑MoW: a Mixture‑of‑Workflows architecture combining multiple self‑corrective RAG (CRAG) Generators with an Aggregator orchestration agent and reciprocal rank fusion (RRF) re‑ranking.

Applied CRAG‑MoW to chemistry: structure search over 250k small molecules, 250k polymers, 250k reactions, plus 2,259 experimental NMR spectra and NMRShiftDB2; multi‑modal retrieval (SMILES + NMR images).

Key Findings

CRAG‑MoW workflows reach LLM‑Judge scores close to GPT‑4o on evaluated tasks.

NumbersCRAG‑MoWs 7.12 vs GPT‑4o 7.59 (LLM‑Judge average, 1–10)

Practical UseYou can assemble open LLMs in a self‑corrective retrieval pipeline to get near‑SOTA judged quality without relying solely on closed APIs.

Evidence RefAbstract; Results (Individual Workflow Evaluation — All Collections)

Aggregated multi‑workflow responses were preferred more often in pairwise comparisons than GPT‑4o.

NumbersTop aggregator win rate 8.77% vs GPT‑4o 5.89% (pairwise win rates)

Practical UseUsing multiple specialized workflows plus an orchestration agent can produce outputs that human or LLM judges prefer over a single generic SOTA model on domain queries.

Evidence RefPairwise Model Evaluation (All Collections), Figure 4 and text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLM‑Judge average score (1–10)	CRAG‑MoWs 7.12; GPT‑4o 7.59	GPT‑4o 7.59	−0.47 (CRAG‑MoW vs GPT‑4o)	All Collections (small molecules, polymers, reactions, NMR)	Abstract; Results: Individual Workflow Evaluation — All Collections	Results — Individual Workflow Evaluation; Figure 3
Pairwise win rate (average)	Top Aggregator (mistral‑nemo) 8.77%; GPT‑4o 5.89%	GPT‑4o 5.89%	+2.88 percentage points (best aggregator vs GPT‑4o)	All Collections (pairwise comparisons across 10 questions)	Pairwise Model Evaluation — All Collections; Figure 4	Results — Pairwise Model Evaluation; Supplemental Table 5

What To Try In 7 Days

Run a simple CRAG loop: index a small, domain vector store (MoLFormer embeddings + Milvus), run 2 generator LLMs, and implement an aggregator that RRF‑fusion ranks the retrieved do

Add a completeness and hallucination checker to the Generator loop to re‑query or regenerate when checks fail; log iteration counts to measure cost.

Run quick A/B pairwise comparisons with an LLM judge (or a small group of domain experts) to benchmark the aggregator vs a single LLM baseline.

Agent Features

Memory

vectorstore retrieval (Milvus) for short‑term retrieval memory

Planning

iterative retrieval and response generationquery rewriting / re‑retrieval when relevance checks fail

Tool Use

document relevance checkinghallucination detectiondocument fusion and reciprocal rank fusion (RRF)

Frameworks

LangChainLangGraphMilvus

Is Agentic

Yes

Architectures

Mixture-of-WorkflowsSelf-Corrective RAG (CRAG)Aggregator-Generator orchestration

Collaboration

aggregator synthesizes outputs from multiple generatorsreciprocal rank fusion integrates cross‑workflow evidence

Optimization Features

Infra Optimization

Milvus IVF_FLAT index (L2 and inner product hybrid search)LangChain orchestration to reuse prompts and tools

System Optimization

Separate vector collections per modality (small molecules, polymers, reactions)Use of MoLFormer chemistry embeddings and OpenCLIP image embeddings projected to 768 dims

Inference Optimization

Configurable recursion and revision limits to bound compute (recursion max 25, revisions max 10)Hybrid search with equal modality weighting and RRF re‑ranking to reduce downstream cost

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation relies on an LLM‑Judge (GPT‑4o‑mini); LLM-as‑judge bias and lack of extensive human labeling are noted.

Incomplete workflows concentrated in the chemical reaction collection indicate retrieval coverage gaps.

When Not To Use

Do not rely on CRAG‑MoW outputs for safety‑critical lab protocols without human expert review.

Avoid deployment as an unsupervised decision maker for high‑risk chemical synthesis or regulatory claims.

Failure Modes

Missing or low‑quality retrieval documents leading to 'NO RELEVANT RAG DOCUMENTS FOUND' or aborted outputs.

Over‑generation or hallucinations when revision limits are hit and models return unchecked answers.

Core Entities

Models

GPT-4omistral-nemo:12b-instructwizardlm2:7bqwen2.5:7b-instructmistral:7b-instructgemma2:9b-instructmixtral:8x7b-instructphi3.5:3.8b-mini-instructllama3.1:8b-instruct

Metrics

LLM‑Judge average score (1–10)Pairwise win rate (%)Workflow completion rate (%)CRAG iterations per step (counts)

Datasets

small molecules (250k)polymers (250k)chemical reactions (250k)experimental NMR spectra (2,259)NMRShiftDB2

Benchmarks

LLM-Judge (1–10 scoring)Pairwise win‑rate comparisons

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CRAG‑MoW workflows reach LLM‑Judge scores close to GPT‑4o on evaluated tasks.

Aggregated multi‑workflow responses were preferred more often in pairwise comparisons than GPT‑4o.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding