Domain-specific AI agents collaborate to find cross-domain knowledge

April 12, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

7

Authors

Shiva Aryal, Tuyen Do, Bisesh Heyojoo, Sandeep Chataut, Bichar Dip Shrestha Gurung, Venkataramana Gadhamshetty, Etienne Gnimpieba

Links

Abstract / PDF

Why It Matters For Business

Orchestrated domain-specific agents can raise answer accuracy for cross-field queries, trading speed for higher-quality, context-aware results.

Summary TLDR

This paper builds and compares four multi-agent workflows that let domain-specialist AI agents collaborate to answer interdisciplinary questions. Each agent was seeded with ~1,000 papers and the authors compare MetaGPT-orchestrated RAG, sequential OpenAI Assistant flows, MetaGPT+Assistant, and a baseline OpenAI GPT flow. The MetaGPT+OpenAI+RAG flow produced the highest answer quality (ROUGE-1 precision 0.49) while the unmodified OpenAI baseline was fastest (64.23 tokens/sec) but low quality. Results come from a small pilot dataset and expert ratings; authors expect trends to improve with more training data.

Problem Statement

AI models are strong inside single disciplines but struggle to synthesize knowledge across fields. The paper asks whether multiple domain-specialist AI agents, coordinated in different workflows, can combine their strengths to answer interdisciplinary queries more accurately and efficiently.

Main Contribution

Design and implement a multi-AI agent platform using domain-specific agents (Boron Nitride, Electrochemical, Bandgap, Nanomaterial, AI).

Compare four workflows: MetaGPT+OpenAI+RAG, sequential OpenAI Assistant, MetaGPT+OpenAI Assistant, and an unmodified OpenAI baseline.

Train each agent on roughly 1,000 research papers and run a small pilot evaluation with expert judgments and automatic metrics.

Report trade-offs: MetaGPT+OpenAI+RAG gave best answer quality; baseline OpenAI gave highest token throughput.

Key Findings

Agents were seeded with domain literature to create domain-specific expertise.

Numbers≈1000 papers per agent (Section 2.1)

MetaGPT+OpenAI+RAG workflow produced highest answer quality by automatic and expert measures.

NumbersROUGE-1 precision = 0.49 for Flow 1 (Section 3.3)

Unmodified OpenAI baseline was far faster but much less precise.

NumbersFlow 4 tokens/sec = 64.23; ROUGE-1 precision = 0.06 (Section 3.3)

Semantic overlap (cosine similarity) favored the MetaGPT+RAG workflow slightly.

NumbersCosine similarity Flow 1 = 0.26 vs Flow 2/3 = 0.22, Flow 4 = 0.25 (Section 3.3)

Study used a small pilot dataset; authors qualify results as trends needing more data.

NumbersDescribed as a 'small pilot' and 'trend expected to be more smooth' with more data (Abstract, Conclusions)

Results

Tokens per second

ValueFlow1: 8.53; Flow2: 7.63; Flow3: 8.50; Flow4: 64.23

ROUGE-1 precision

ValueFlow1: 0.49; Flow2: 0.05; Flow3: 0.05; Flow4: 0.06

Cosine similarity

ValueFlow1: 0.26; Flow2: 0.22; Flow3: 0.22; Flow4: 0.25

Expert quality rating

ValueFlow1 highest by expert judgment; Flow3 second; Flow2/4 lower

Who Should Care

What To Try In 7 Days

Run a small pilot: build one domain agent with ~500–1,000 docs using your internal text.

Compare a MetaGPT-orchestrated RAG flow vs a plain LLM baseline on 10 real queries.

Measure tokens/sec and ROUGE/cosine to see speed vs quality trade-offs.

Agent Features

Memory

  • Short-term context passing between agents
  • Automatic document chunking and indexing (Assistant)

Planning

  • sequential pass of context (pipeline order)
  • orchestrated flow managed by MetaGPT

Tool Use

  • RAG retriever + generator
  • Embedding models and vector search
  • Elasticsearch (likely)

Frameworks

  • MetaGPT
  • OpenAI Assistant API
  • RAG

Is Agentic

true

Architectures

  • ReAct-style observe-think-act
  • Orchestrated multi-agent (MetaGPT)
  • RAG generator-retriever

Collaboration

  • Sequential agent chains
  • MetaGPT orchestration (context sharing)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation is a small pilot; numbers are preliminary and not statistically robust.
  • No public code or dataset is provided for independent verification.
  • Only five domain agents were tested, limiting claims about general cross-domain scaling.
  • Expert ratings are mentioned but details of the human evaluation protocol are limited.

When Not To Use

  • For low-risk or latency-sensitive tasks where speed matters more than domain accuracy.
  • When you lack domain documents to seed agents (needs ~hundreds–thousands of papers).
  • If you require reproducible public benchmarks—paper lacks shared code/data.

Failure Modes

  • Wrong or ambiguous retrievals can lead to incorrect answers; authors trigger web search to mitigate.
  • Coordination overhead may slow systems as agent count or domain breadth grows.
  • Quality depends on domain corpus; noisy or small corpora reduce effectiveness.

Core Entities

Models

  • OpenAI GPT
  • OpenAI Assistant
  • MetaGPT
  • Retriever+Generator RAG
  • Embedding model

Metrics

  • ROUGE-1 precision
  • Cosine similarity
  • Tokens per second

Datasets

  • Small pilot evaluation dataset (not released)
  • ≈1000 research papers per agent (domain corpora)