Learned router that decides when and where to fetch facts from multiple KBs during stepwise multimodal reasoning

May 28, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

Adaptive routing reduces unnecessary retrievals and raises answer quality for mixed text/image/table queries, cutting retrieval cost and improving accuracy for knowledge-heavy apps.

Summary TLDR

R1-Router trains a multimodal LLM to generate intermediate queries, choose which knowledge base to consult (text, image-text, table), and integrate retrieved evidence step-by-step. Training uses a new RL objective, Step-GRPO, that assigns rewards per reasoning step for better query quality and routing. On mixed text/visual/table QA sets R1-Router improves average F1-Recall versus strong baselines (about +7 percentage points on evaluated benchmarks), cuts unnecessary retrievals for VQA and Table QA, and releases code.

Problem Statement

Existing multimodal retrieval-augmented methods fetch from many knowledge bases in a fixed way and do not let the model decide dynamically which KB to query during stepwise reasoning. That rigidity wastes retrievals and limits accuracy on multi-step, multi-modality QA.

Main Contribution

R1-Router: a framework that lets an MLLM generate intermediate queries and route them to specific KBs (text, text-image, table) during iterative reasoning.

Step-GRPO: a step-wise RL objective that gives per-step rewards for query relevance, routing accuracy, and intermediate answer quality.

Empirical gains: consistent improvements across Text QA, Visual QA, and Table QA (average F1-Recall up ~7 pts), plus fewer retrieval steps in VQA and Table QA. Code released.

Key Findings

R1-Router raises average F1-Recall across evaluated QA benchmarks.

NumbersAvg F1-Recall 55.93 vs 48.29 (IterRetGen), +7.64 pts

Step-GRPO training beats supervised fine-tuning and prompt-only methods.

NumbersSelf-routing avg: Step-GRPO 55.93 vs SFT 42.70, +13.23 pts

R1-Router reduces the number of retrieval steps needed for correct answers in VQA and Table QA.

Step-GRPO shifts retrieval preference from image-based retrievers toward text retrievers when appropriate.

Results

Average F1-Recall (mixed QA evaluation)

Value55.93

Baseline48.29 (IterRetGen)

Self-routing ablation (Avg F1-Recall)

Value55.93 (Step-GRPO)

Baseline42.70 (SFT)

Example per-dataset: WebQA F1-Recall

Value90.92 (R1-Router)

Baseline84.19 (IterRetGen)

Who Should Care

What To Try In 7 Days

Run R1-Router code on a small multimodal QA slice to compare F1-Recall vs your current RAG pipeline.

Replace fixed multi-KB retrieval with per-step retriever selection and measure average retrieval calls per query.

Train Step-GRPO on a handful of high-quality reasoning trajectories and compare SFT vs Step-GRPO routing accuracy.

Agent Features

Memory

  • Hybrid knowledge bases (text, image-text, tables)

Planning

  • Step-wise retrieval planning
  • Intermediate query generation

Tool Use

  • Text Retriever
  • Text-Image Retriever
  • Table Retriever

Frameworks

  • GRPO

Is Agentic

true

Architectures

  • Multimodal LLM backbone + modular retrievers

Optimization Features

Token Efficiency

  • Fewer retrieval results concatenated when router stops early

System Optimization

  • Adaptive per-step KB selection to avoid broad multi-KB retrieval

Training Optimization

  • GRPO
  • SFT

Inference Optimization

  • Limits to max 3 retrieval steps
  • Reduces unnecessary retrievals in VQA/Table QA

Reproducibility

Data Urls

  • 2WikiMultihopQA
  • InfoSeek
  • Dyn-VQA
  • WebQA
  • Open-WikiTable
  • TabFact
  • M-BEIR
  • Wikipedia dumps

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Adds inference overhead from multi-step reasoning and extra retriever calls.
  • Step-GRPO is trained on a small synthetic set of golden trajectories that may contain errors.
  • Evaluation focuses on curated QA datasets; real-world KB distributions may differ.

When Not To Use

  • When strict low-latency is essential and extra reasoning steps are unacceptable.
  • For simple single-hop queries where single-shot retrieval is sufficient.
  • When there is no access to modality-specific retrievers or heterogeneous KBs.

Failure Modes

  • Incorrect retriever selection leads to irrelevant evidence and wrong answers.
  • Poorly filtered training trajectories can teach suboptimal reasoning policies.
  • Potential loops or wasted steps if the model fails to decide to stop within max steps.

Core Entities

Models

  • Qwen2.5-VL-7B
  • R1-Distill-Qwen-32B
  • BGE-M3
  • Qwen2.5-VL-7B (used as backbone in baselines)

Metrics

  • F1-Recall
  • Accuracy

Datasets

  • 2WikiMultihopQA
  • InfoSeek
  • Dyn-VQA
  • WebQA
  • Open-WikiTable
  • TabFact
  • M-BEIR
  • Wikipedia dump (20241020)

Benchmarks

  • Text QA (2WikiMultihopQA)
  • Visual QA (InfoSeek, Dyn-VQA, WebQA)
  • Table QA (Open-WikiTable, TabFact)

Context Entities

Models

  • IterRetGen
  • IRCoT
  • CogPlanner
  • OmniSearch
  • Search-O1