Overview
The approach is practical for pipelines where retriever can stay fixed; expect modest training cost increase and consistent F1 gains on multi-hop QA when modules are jointly optimized.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Jointly training retrieval pipeline modules toward the final answer quality raises factuality and gives consistent F1 gains, with no extra inference cost, making it attractive when accuracy matters more than extra training compute.
Who Should Care
Summary TLDR
MMOA-RAG models a Retrieval-Augmented Generation (RAG) pipeline as a cooperative multi-agent system. Query Rewriter, Selector and Generator are treated as LLM agents and jointly fine-tuned with Multi-Agent PPO (MAPPO) using the final-answer F1 as a shared reward. On HotpotQA, 2WikiMultihopQA and AmbigQA (Llama-3-8B-Instruct backbone, Contriever/BGE/E5 retrievers) MMOA-RAG improves F1 by ~+1.8 to +2.7 points over strong baselines. Warm-start SFT, parameter sharing and simple penalties stabilize training. Training time grows linearly with number of agents; inference cost is unchanged.
Problem Statement
RAG pipelines have several modules trained separately. Independent supervised fine-tuning misaligns module objectives and the final answer quality. Prior RL fixes focus on two-module pipelines or single-module updates and fail to model the complex, collaborative interactions among multiple modules in modern RAG systems.
Main Contribution
Formulate multi-module RAG as a cooperative multi-agent RL problem where QR, Selector and Generator are agents.
Introduce MMOA-RAG: warm-start SFT then joint MAPPO training with shared final-answer F1 reward and simple output penalties.
Key Findings
Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.
MAPPO on top of SFT yields larger gains than SFT alone, especially on multi-hop tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1 | 48.29 | Best baseline F1 46.49 (RetRobust/BGM) | +1.80 | HotpotQA | Table 1: MMOA-RAG F1 48.29 vs baselines | Table 1 |
| F1 | 46.40 | Best baseline F1 44.51 | +1.89 | 2WikiMultihopQA | Table 1: MMOA-RAG F1 46.40 vs baselines | Table 1 |
What To Try In 7 Days
Warm-start QR, Selector and Generator with SFT on your QA pairs.
Freeze your retriever; run parameter-shared MAPPO to jointly train the three modules using answer F1 as reward.
Measure final F1/EM and compare to SFT-only; check convergence curves for stability.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Retriever kept fixed — method does not optimize first-stage retriever.
Training cost grows linearly with number of agents; costs may be noticeable beyond three agents.
When Not To Use
When you need to jointly train or change the first-stage retriever.
When training budget is extremely limited or low-latency training is required.
Failure Modes
Agents learn degenerate behaviors (excessive sub-questions) and incur penalties.
Selector outputs bad ID formats or duplicates, triggering negative rewards and unstable training.

