Treat RAG modules as cooperative RL agents (MAPPO) to raise final-answer F1.

January 25, 20257 min

Overview

Decision SnapshotNeeds Validation

The approach is practical for pipelines where retriever can stay fixed; expect modest training cost increase and consistent F1 gains on multi-hop QA when modules are jointly optimized.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Jointly training retrieval pipeline modules toward the final answer quality raises factuality and gives consistent F1 gains, with no extra inference cost, making it attractive when accuracy matters more than extra training compute.

Who Should Care

Summary TLDR

MMOA-RAG models a Retrieval-Augmented Generation (RAG) pipeline as a cooperative multi-agent system. Query Rewriter, Selector and Generator are treated as LLM agents and jointly fine-tuned with Multi-Agent PPO (MAPPO) using the final-answer F1 as a shared reward. On HotpotQA, 2WikiMultihopQA and AmbigQA (Llama-3-8B-Instruct backbone, Contriever/BGE/E5 retrievers) MMOA-RAG improves F1 by ~+1.8 to +2.7 points over strong baselines. Warm-start SFT, parameter sharing and simple penalties stabilize training. Training time grows linearly with number of agents; inference cost is unchanged.

Problem Statement

RAG pipelines have several modules trained separately. Independent supervised fine-tuning misaligns module objectives and the final answer quality. Prior RL fixes focus on two-module pipelines or single-module updates and fail to model the complex, collaborative interactions among multiple modules in modern RAG systems.

Main Contribution

Formulate multi-module RAG as a cooperative multi-agent RL problem where QR, Selector and Generator are agents.

Introduce MMOA-RAG: warm-start SFT then joint MAPPO training with shared final-answer F1 reward and simple output penalties.

Key Findings

Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.

NumbersHotpotQA F1 +1.80; 2Wiki F1 +1.89; AmbigQA F1 +2.67 (Table 1).

Practical UseIf you jointly train QR+Selector+Generator with MAPPO, expect ~+1–3 percentage points F1 on similar QA tasks.

Evidence RefTable 1

MAPPO on top of SFT yields larger gains than SFT alone, especially on multi-hop tasks.

NumbersMAPPO vs SFT: HotpotQA F1 +3.60; 2Wiki F1 +3.43 (Table 2).

Practical UseWarm-start each module with SFT, then run MAPPO joint training to get an additional ~3 pp F1 for multi-hop QA.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F148.29Best baseline F1 46.49 (RetRobust/BGM)+1.80HotpotQATable 1: MMOA-RAG F1 48.29 vs baselinesTable 1
F146.40Best baseline F1 44.51+1.892WikiMultihopQATable 1: MMOA-RAG F1 46.40 vs baselinesTable 1

What To Try In 7 Days

Warm-start QR, Selector and Generator with SFT on your QA pairs.

Freeze your retriever; run parameter-shared MAPPO to jointly train the three modules using answer F1 as reward.

Measure final F1/EM and compare to SFT-only; check convergence curves for stability.

Agent Features

Memory
Retrieval memory: external Wikipedia passages
Planning
Cooperative multi-agent optimization with shared reward
Tool Use
Fixed dense retriever as environment (Contriever/BGE/E5)
Frameworks
MAPPOPPO
Is Agentic

Yes

Architectures
LLM-based agents (shared LLM with different prompts)
Collaboration
Reward sharing (global F1)Parameter sharing (single policy for all agents)

Optimization Features

Token Efficiency
Selector restricts candidate IDs to reduce action space
Infra Optimization
Training time increases linearly with agent count; manageable for three agents
Model Optimization
Parameter sharing across agents to reduce model size
System Optimization
Minibatch parallel rollouts to speed training
Training Optimization
SFTMAPPO joint optimization with shared rewardPenalty terms to constrain outputs (length, duplicates, format)
Inference Optimization
No extra inference overhead compared to baseline RAG

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

HotpotQA: https://hotpotqa.github.io/2WikiMultihopQA: https://github.com/xxx (paper references)AmbigQA: https://github.com/google-research-datasets/ambigqa

Risks & Boundaries

Limitations

Retriever kept fixed — method does not optimize first-stage retriever.

Training cost grows linearly with number of agents; costs may be noticeable beyond three agents.

When Not To Use

When you need to jointly train or change the first-stage retriever.

When training budget is extremely limited or low-latency training is required.

Failure Modes

Agents learn degenerate behaviors (excessive sub-questions) and incur penalties.

Selector outputs bad ID formats or duplicates, triggering negative rewards and unstable training.

Core Entities

Models

MMOA-RAGLlama-3-8B-InstructContrieverBGEE5MAPPO

Metrics

Exact Match (EM)AccuracyF1 score

Datasets

HotpotQA2WikiMultihopQAAmbigQAWikipedia passages (corpus for retrieval)

Benchmarks

HotpotQA2WikiMultihopQAAmbigQA

Context Entities

Models

SELF-RAGRetRobustRewrite-Retrieve-ReadBGMRAG-DDR