Treat RAG modules as cooperative RL agents (MAPPO) to raise final-answer F1.

Overview

Decision SnapshotNeeds Validation

The approach is practical for pipelines where retriever can stay fixed; expect modest training cost increase and consistent F1 gains on multi-hop QA when modules are jointly optimized.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Jointly training retrieval pipeline modules toward the final answer quality raises factuality and gives consistent F1 gains, with no extra inference cost, making it attractive when accuracy matters more than extra training compute.

Who Should Care

ML Engineer Product Manager CTO Engineering Lead

Summary TLDR

MMOA-RAG models a Retrieval-Augmented Generation (RAG) pipeline as a cooperative multi-agent system. Query Rewriter, Selector and Generator are treated as LLM agents and jointly fine-tuned with Multi-Agent PPO (MAPPO) using the final-answer F1 as a shared reward. On HotpotQA, 2WikiMultihopQA and AmbigQA (Llama-3-8B-Instruct backbone, Contriever/BGE/E5 retrievers) MMOA-RAG improves F1 by ~+1.8 to +2.7 points over strong baselines. Warm-start SFT, parameter sharing and simple penalties stabilize training. Training time grows linearly with number of agents; inference cost is unchanged.

Problem Statement

RAG pipelines have several modules trained separately. Independent supervised fine-tuning misaligns module objectives and the final answer quality. Prior RL fixes focus on two-module pipelines or single-module updates and fail to model the complex, collaborative interactions among multiple modules in modern RAG systems.

Main Contribution

Formulate multi-module RAG as a cooperative multi-agent RL problem where QR, Selector and Generator are agents.

Introduce MMOA-RAG: warm-start SFT then joint MAPPO training with shared final-answer F1 reward and simple output penalties.

Key Findings

Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.

NumbersHotpotQA F1 +1.80; 2Wiki F1 +1.89; AmbigQA F1 +2.67 (Table 1).

Practical UseIf you jointly train QR+Selector+Generator with MAPPO, expect ~+1–3 percentage points F1 on similar QA tasks.

Evidence RefTable 1

MAPPO on top of SFT yields larger gains than SFT alone, especially on multi-hop tasks.

NumbersMAPPO vs SFT: HotpotQA F1 +3.60; 2Wiki F1 +3.43 (Table 2).

Practical UseWarm-start each module with SFT, then run MAPPO joint training to get an additional ~3 pp F1 for multi-hop QA.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1	48.29	Best baseline F1 46.49 (RetRobust/BGM)	+1.80	HotpotQA	Table 1: MMOA-RAG F1 48.29 vs baselines	Table 1
F1	46.40	Best baseline F1 44.51	+1.89	2WikiMultihopQA	Table 1: MMOA-RAG F1 46.40 vs baselines	Table 1

What To Try In 7 Days

Warm-start QR, Selector and Generator with SFT on your QA pairs.

Freeze your retriever; run parameter-shared MAPPO to jointly train the three modules using answer F1 as reward.

Measure final F1/EM and compare to SFT-only; check convergence curves for stability.

Agent Features

Memory

Retrieval memory: external Wikipedia passages

Planning

Cooperative multi-agent optimization with shared reward

Tool Use

Fixed dense retriever as environment (Contriever/BGE/E5)

Frameworks

MAPPOPPO

Is Agentic

Yes

Architectures

LLM-based agents (shared LLM with different prompts)

Collaboration

Reward sharing (global F1)Parameter sharing (single policy for all agents)

Optimization Features

Token Efficiency

Selector restricts candidate IDs to reduce action space

Infra Optimization

Training time increases linearly with agent count; manageable for three agents

Model Optimization

Parameter sharing across agents to reduce model size

System Optimization

Minibatch parallel rollouts to speed training

Training Optimization

SFTMAPPO joint optimization with shared rewardPenalty terms to constrain outputs (length, duplicates, format)

Inference Optimization

No extra inference overhead compared to baseline RAG

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/chenyiqun/MMOA-RAG

Data URLs

HotpotQA: https://hotpotqa.github.io/2WikiMultihopQA: https://github.com/xxx (paper references)AmbigQA: https://github.com/google-research-datasets/ambigqa

Risks & Boundaries

Limitations

Retriever kept fixed — method does not optimize first-stage retriever.

Training cost grows linearly with number of agents; costs may be noticeable beyond three agents.

When Not To Use

When you need to jointly train or change the first-stage retriever.

When training budget is extremely limited or low-latency training is required.

Failure Modes

Agents learn degenerate behaviors (excessive sub-questions) and incur penalties.

Selector outputs bad ID formats or duplicates, triggering negative rewards and unstable training.

Core Entities

Models

MMOA-RAGLlama-3-8B-InstructContrieverBGEE5MAPPO

Metrics

Exact Match (EM)AccuracyF1 score

Datasets

HotpotQA2WikiMultihopQAAmbigQAWikipedia passages (corpus for retrieval)

Benchmarks

HotpotQA2WikiMultihopQAAmbigQA

Context Entities

Models

SELF-RAGRetRobustRewrite-Retrieve-ReadBGMRAG-DDR

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.

MAPPO on top of SFT yields larger gains than SFT alone, especially on multi-hop tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding