Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
You can build a domain‑focused, multi‑agent retrieval+LLM pipeline using open models and get judged output quality close to closed SOTA (GPT‑4o) while increasing preference rate; this reduces vendor lock‑in and enables on‑prem, interpretable retrieval and model selection.
Summary TLDR
This paper introduces CRAG‑MoW (Mixture‑of‑Workflows), a multi‑agent system that runs multiple self‑corrective retrieval‑augmented generation (RAG) workflows and an orchestration agent to synthesize results. The authors index chemistry data (SMILES and NMR) with MoLFormer/OpenCLIP embeddings in Milvus, run 19 agentic workflows built from nine open LLMs plus a GPT‑4o baseline, and evaluate with an LLM‑Judge (GPT‑4o‑mini). CRAG‑MoWs scored near GPT‑4o on the LLM‑Judge (7.12 vs. 7.59) and won more pairwise comparisons (top aggregator win rate 8.77% vs GPT‑4o 5.89%). Chemical reactions were the hardest retrieval task (many incomplete runs). Code/data to be released on publication.
Problem Statement
Materials and chemistry tasks need up‑to‑date, structured retrieval plus multi‑step reasoning. Single LLMs often lack reliable retrieval, hallucination checks, and domain benchmarks. There is no standard framework to 1) run multiple LLMs with structured RAG, 2) self‑correct retrieval and generation, and 3) compare open models to closed SOTA on the same tasks.
Main Contribution
CRAG‑MoW: a Mixture‑of‑Workflows architecture combining multiple self‑corrective RAG (CRAG) Generators with an Aggregator orchestration agent and reciprocal rank fusion (RRF) re‑ranking.
Applied CRAG‑MoW to chemistry: structure search over 250k small molecules, 250k polymers, 250k reactions, plus 2,259 experimental NMR spectra and NMRShiftDB2; multi‑modal retrieval (SMILES + NMR images).
Systematic evaluation using an LLM‑Judge (GPT‑4o‑mini) on per‑workflow 1–10 scoring and exhaustive pairwise comparisons; reported both average scores and pairwise win rates.
Empirical result: properly orchestrated multi‑workflow systems built from open LLMs can match or be preferred to GPT‑4o on judged outputs, highlighting model‑specific strengths across chemical modalities.
Key Findings
CRAG‑MoW workflows reach LLM‑Judge scores close to GPT‑4o on evaluated tasks.
Aggregated multi‑workflow responses were preferred more often in pairwise comparisons than GPT‑4o.
Chemical reaction queries caused the most incomplete CRAG runs and highest retrieval/rewrite counts.
Results
LLM‑Judge average score (1–10)
Pairwise win rate (average)
Workflow completion rate (chemical reaction collection)
Who Should Care
What To Try In 7 Days
Run a simple CRAG loop: index a small, domain vector store (MoLFormer embeddings + Milvus), run 2 generator LLMs, and implement an aggregator that RRF‑fusion ranks the retrieved do
Add a completeness and hallucination checker to the Generator loop to re‑query or regenerate when checks fail; log iteration counts to measure cost.
Run quick A/B pairwise comparisons with an LLM judge (or a small group of domain experts) to benchmark the aggregator vs a single LLM baseline.
Agent Features
Memory
- vectorstore retrieval (Milvus) for short‑term retrieval memory
Planning
- iterative retrieval and response generation
- query rewriting / re‑retrieval when relevance checks fail
Tool Use
- document relevance checking
- hallucination detection
- document fusion and reciprocal rank fusion (RRF)
Frameworks
- LangChain
- LangGraph
- Milvus
Is Agentic
true
Architectures
- Mixture-of-Workflows
- Self-Corrective RAG (CRAG)
- Aggregator-Generator orchestration
Collaboration
- aggregator synthesizes outputs from multiple generators
- reciprocal rank fusion integrates cross‑workflow evidence
Optimization Features
Infra Optimization
- Milvus IVF_FLAT index (L2 and inner product hybrid search)
- LangChain orchestration to reuse prompts and tools
System Optimization
- Separate vector collections per modality (small molecules, polymers, reactions)
- Use of MoLFormer chemistry embeddings and OpenCLIP image embeddings projected to 768 dims
Inference Optimization
- Configurable recursion and revision limits to bound compute (recursion max 25, revisions max 10)
- Hybrid search with equal modality weighting and RRF re‑ranking to reduce downstream cost
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation relies on an LLM‑Judge (GPT‑4o‑mini); LLM-as‑judge bias and lack of extensive human labeling are noted.
- Incomplete workflows concentrated in the chemical reaction collection indicate retrieval coverage gaps.
- Results use 10 benchmark questions per collection — small question set limits statistical generality.
- Some open LLMs required fallback tool models for tool‑use tasks, adding heterogeneity to comparisons.
When Not To Use
- Do not rely on CRAG‑MoW outputs for safety‑critical lab protocols without human expert review.
- Avoid deployment as an unsupervised decision maker for high‑risk chemical synthesis or regulatory claims.
- Not ideal when low latency is the top priority (multi‑agent orchestration adds overhead).
Failure Modes
- Missing or low‑quality retrieval documents leading to 'NO RELEVANT RAG DOCUMENTS FOUND' or aborted outputs.
- Over‑generation or hallucinations when revision limits are hit and models return unchecked answers.
- Aggregator synthesizes conflicting or low‑quality generator outputs if multiple generators fail similarly.
Core Entities
Models
- GPT-4o
- mistral-nemo:12b-instruct
- wizardlm2:7b
- qwen2.5:7b-instruct
- mistral:7b-instruct
- gemma2:9b-instruct
- mixtral:8x7b-instruct
- phi3.5:3.8b-mini-instruct
- llama3.1:8b-instruct
Metrics
- LLM‑Judge average score (1–10)
- Pairwise win rate (%)
- Workflow completion rate (%)
- CRAG iterations per step (counts)
Datasets
- small molecules (250k)
- polymers (250k)
- chemical reactions (250k)
- experimental NMR spectra (2,259)
- NMRShiftDB2
Benchmarks
- LLM-Judge (1–10 scoring)
- Pairwise win‑rate comparisons

