Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Jointly training retrieval pipeline modules toward the final answer quality raises factuality and gives consistent F1 gains, with no extra inference cost, making it attractive when accuracy matters more than extra training compute.
Summary TLDR
MMOA-RAG models a Retrieval-Augmented Generation (RAG) pipeline as a cooperative multi-agent system. Query Rewriter, Selector and Generator are treated as LLM agents and jointly fine-tuned with Multi-Agent PPO (MAPPO) using the final-answer F1 as a shared reward. On HotpotQA, 2WikiMultihopQA and AmbigQA (Llama-3-8B-Instruct backbone, Contriever/BGE/E5 retrievers) MMOA-RAG improves F1 by ~+1.8 to +2.7 points over strong baselines. Warm-start SFT, parameter sharing and simple penalties stabilize training. Training time grows linearly with number of agents; inference cost is unchanged.
Problem Statement
RAG pipelines have several modules trained separately. Independent supervised fine-tuning misaligns module objectives and the final answer quality. Prior RL fixes focus on two-module pipelines or single-module updates and fail to model the complex, collaborative interactions among multiple modules in modern RAG systems.
Main Contribution
Formulate multi-module RAG as a cooperative multi-agent RL problem where QR, Selector and Generator are agents.
Introduce MMOA-RAG: warm-start SFT then joint MAPPO training with shared final-answer F1 reward and simple output penalties.
Show empirical gains on three QA datasets and ablations that joint optimization and MAPPO generalize across retrievers and pipeline configs.
Key Findings
Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.
MAPPO on top of SFT yields larger gains than SFT alone, especially on multi-hop tasks.
Ablations show optimizing all three agents together gives the best results and fastest training convergence.
Method generalizes across different retrievers and out-of-domain testing.
Results
F1
F1
F1
Who Should Care
What To Try In 7 Days
Warm-start QR, Selector and Generator with SFT on your QA pairs.
Freeze your retriever; run parameter-shared MAPPO to jointly train the three modules using answer F1 as reward.
Measure final F1/EM and compare to SFT-only; check convergence curves for stability.
Agent Features
Memory
- Retrieval memory: external Wikipedia passages
Planning
- Cooperative multi-agent optimization with shared reward
Tool Use
- Fixed dense retriever as environment (Contriever/BGE/E5)
Frameworks
- MAPPO
- PPO
Is Agentic
true
Architectures
- LLM-based agents (shared LLM with different prompts)
Collaboration
- Reward sharing (global F1)
- Parameter sharing (single policy for all agents)
Optimization Features
Token Efficiency
- Selector restricts candidate IDs to reduce action space
Infra Optimization
- Training time increases linearly with agent count; manageable for three agents
Model Optimization
- Parameter sharing across agents to reduce model size
System Optimization
- Minibatch parallel rollouts to speed training
Training Optimization
- SFT
- MAPPO joint optimization with shared reward
- Penalty terms to constrain outputs (length, duplicates, format)
Inference Optimization
- No extra inference overhead compared to baseline RAG
Reproducibility
Data Urls
- HotpotQA: https://hotpotqa.github.io/
- 2WikiMultihopQA: https://github.com/xxx (paper references)
- AmbigQA: https://github.com/google-research-datasets/ambigqa
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Retriever kept fixed — method does not optimize first-stage retriever.
- Training cost grows linearly with number of agents; costs may be noticeable beyond three agents.
- Parameter sharing might not scale or remain stable when many heterogeneous agents are added.
- Reward is based solely on F1; may favor surface-level matches over nuanced faithfulness.
When Not To Use
- When you need to jointly train or change the first-stage retriever.
- When training budget is extremely limited or low-latency training is required.
- If final evaluation requires metrics other than F1 (e.g., factual grounding, runtime cost) without redesigning reward.
Failure Modes
- Agents learn degenerate behaviors (excessive sub-questions) and incur penalties.
- Selector outputs bad ID formats or duplicates, triggering negative rewards and unstable training.
- Overfitting to F1 metric that may not capture factual correctness or hallucinations.
- Parameter sharing causes conflicts between functions if agent roles diverge strongly.
Core Entities
Models
- MMOA-RAG
- Llama-3-8B-Instruct
- Contriever
- BGE
- E5
- MAPPO
Metrics
- Exact Match (EM)
- Accuracy
- F1 score
Datasets
- HotpotQA
- 2WikiMultihopQA
- AmbigQA
- Wikipedia passages (corpus for retrieval)
Benchmarks
- HotpotQA
- 2WikiMultihopQA
- AmbigQA
Context Entities
Models
- SELF-RAG
- RetRobust
- Rewrite-Retrieve-Read
- BGM
- RAG-DDR

