Treat RAG modules as cooperative RL agents (MAPPO) to raise final-answer F1.

January 25, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

1

Authors

Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao

Links

Abstract / PDF

Why It Matters For Business

Jointly training retrieval pipeline modules toward the final answer quality raises factuality and gives consistent F1 gains, with no extra inference cost, making it attractive when accuracy matters more than extra training compute.

Summary TLDR

MMOA-RAG models a Retrieval-Augmented Generation (RAG) pipeline as a cooperative multi-agent system. Query Rewriter, Selector and Generator are treated as LLM agents and jointly fine-tuned with Multi-Agent PPO (MAPPO) using the final-answer F1 as a shared reward. On HotpotQA, 2WikiMultihopQA and AmbigQA (Llama-3-8B-Instruct backbone, Contriever/BGE/E5 retrievers) MMOA-RAG improves F1 by ~+1.8 to +2.7 points over strong baselines. Warm-start SFT, parameter sharing and simple penalties stabilize training. Training time grows linearly with number of agents; inference cost is unchanged.

Problem Statement

RAG pipelines have several modules trained separately. Independent supervised fine-tuning misaligns module objectives and the final answer quality. Prior RL fixes focus on two-module pipelines or single-module updates and fail to model the complex, collaborative interactions among multiple modules in modern RAG systems.

Main Contribution

Formulate multi-module RAG as a cooperative multi-agent RL problem where QR, Selector and Generator are agents.

Introduce MMOA-RAG: warm-start SFT then joint MAPPO training with shared final-answer F1 reward and simple output penalties.

Show empirical gains on three QA datasets and ablations that joint optimization and MAPPO generalize across retrievers and pipeline configs.

Key Findings

Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.

NumbersHotpotQA F1 +1.80; 2Wiki F1 +1.89; AmbigQA F1 +2.67 (Table 1).

MAPPO on top of SFT yields larger gains than SFT alone, especially on multi-hop tasks.

NumbersMAPPO vs SFT: HotpotQA F1 +3.60; 2Wiki F1 +3.43 (Table 2).

Ablations show optimizing all three agents together gives the best results and fastest training convergence.

NumbersFull MMOA-RAG outperforms variants missing QR/S/G; reward curve converges faster (Figure 3).

Method generalizes across different retrievers and out-of-domain testing.

NumbersWins or ties over baselines with BGE/E5 retrievers (Table 7); OOD Hotpot→AmbigQA F1: 45.62 vs 44.08 baseline (Table 8).

Results

F1

Value48.29

BaselineBest baseline F1 46.49 (RetRobust/BGM)

F1

Value46.40

BaselineBest baseline F1 44.51

F1

Value48.59

BaselineBest baseline F1 45.92

Who Should Care

What To Try In 7 Days

Warm-start QR, Selector and Generator with SFT on your QA pairs.

Freeze your retriever; run parameter-shared MAPPO to jointly train the three modules using answer F1 as reward.

Measure final F1/EM and compare to SFT-only; check convergence curves for stability.

Agent Features

Memory

  • Retrieval memory: external Wikipedia passages

Planning

  • Cooperative multi-agent optimization with shared reward

Tool Use

  • Fixed dense retriever as environment (Contriever/BGE/E5)

Frameworks

  • MAPPO
  • PPO

Is Agentic

true

Architectures

  • LLM-based agents (shared LLM with different prompts)

Collaboration

  • Reward sharing (global F1)
  • Parameter sharing (single policy for all agents)

Optimization Features

Token Efficiency

  • Selector restricts candidate IDs to reduce action space

Infra Optimization

  • Training time increases linearly with agent count; manageable for three agents

Model Optimization

  • Parameter sharing across agents to reduce model size

System Optimization

  • Minibatch parallel rollouts to speed training

Training Optimization

  • SFT
  • MAPPO joint optimization with shared reward
  • Penalty terms to constrain outputs (length, duplicates, format)

Inference Optimization

  • No extra inference overhead compared to baseline RAG

Reproducibility

Data Urls

  • HotpotQA: https://hotpotqa.github.io/
  • 2WikiMultihopQA: https://github.com/xxx (paper references)
  • AmbigQA: https://github.com/google-research-datasets/ambigqa

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Retriever kept fixed — method does not optimize first-stage retriever.
  • Training cost grows linearly with number of agents; costs may be noticeable beyond three agents.
  • Parameter sharing might not scale or remain stable when many heterogeneous agents are added.
  • Reward is based solely on F1; may favor surface-level matches over nuanced faithfulness.

When Not To Use

  • When you need to jointly train or change the first-stage retriever.
  • When training budget is extremely limited or low-latency training is required.
  • If final evaluation requires metrics other than F1 (e.g., factual grounding, runtime cost) without redesigning reward.

Failure Modes

  • Agents learn degenerate behaviors (excessive sub-questions) and incur penalties.
  • Selector outputs bad ID formats or duplicates, triggering negative rewards and unstable training.
  • Overfitting to F1 metric that may not capture factual correctness or hallucinations.
  • Parameter sharing causes conflicts between functions if agent roles diverge strongly.

Core Entities

Models

  • MMOA-RAG
  • Llama-3-8B-Instruct
  • Contriever
  • BGE
  • E5
  • MAPPO

Metrics

  • Exact Match (EM)
  • Accuracy
  • F1 score

Datasets

  • HotpotQA
  • 2WikiMultihopQA
  • AmbigQA
  • Wikipedia passages (corpus for retrieval)

Benchmarks

  • HotpotQA
  • 2WikiMultihopQA
  • AmbigQA

Context Entities

Models

  • SELF-RAG
  • RetRobust
  • Rewrite-Retrieve-Read
  • BGM
  • RAG-DDR