Learned router that decides when and where to fetch facts from multiple KBs during stepwise multimodal reasoning

May 28, 20257 min

Overview

Decision SnapshotNeeds Validation

The paper shows consistent benchmark gains and step-level RL benefits, but training relies on moderate-size synthetic trajectories and a 3-step inference cap, so expect further engineering before large-scale deployment.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Adaptive routing reduces unnecessary retrievals and raises answer quality for mixed text/image/table queries, cutting retrieval cost and improving accuracy for knowledge-heavy apps.

Who Should Care

Summary TLDR

R1-Router trains a multimodal LLM to generate intermediate queries, choose which knowledge base to consult (text, image-text, table), and integrate retrieved evidence step-by-step. Training uses a new RL objective, Step-GRPO, that assigns rewards per reasoning step for better query quality and routing. On mixed text/visual/table QA sets R1-Router improves average F1-Recall versus strong baselines (about +7 percentage points on evaluated benchmarks), cuts unnecessary retrievals for VQA and Table QA, and releases code.

Problem Statement

Existing multimodal retrieval-augmented methods fetch from many knowledge bases in a fixed way and do not let the model decide dynamically which KB to query during stepwise reasoning. That rigidity wastes retrievals and limits accuracy on multi-step, multi-modality QA.

Main Contribution

R1-Router: a framework that lets an MLLM generate intermediate queries and route them to specific KBs (text, text-image, table) during iterative reasoning.

Step-GRPO: a step-wise RL objective that gives per-step rewards for query relevance, routing accuracy, and intermediate answer quality.

Key Findings

R1-Router raises average F1-Recall across evaluated QA benchmarks.

NumbersAvg F1-Recall 55.93 vs 48.29 (IterRetGen), +7.64 pts

Practical UseSwap in R1-Router to boost multimodal QA accuracy on mixed-domain testbeds; expect single-digit to mid-single-digit absolute F1 gains.

Evidence RefTable 1

Step-GRPO training beats supervised fine-tuning and prompt-only methods.

NumbersSelf-routing avg: Step-GRPO 55.93 vs SFT 42.70, +13.23 pts

Practical UseUse step-wise RL (Step-GRPO) rather than only SFT to improve both query generation and KB routing.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average F1-Recall (mixed QA evaluation)55.9348.29 (IterRetGen)+7.64 ptsAvg over Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQATable 1 reports R1-Router avg 55.93 vs IterRetGen 48.29Table 1
Self-routing ablation (Avg F1-Recall)55.93 (Step-GRPO)42.70 (SFT)+13.23 ptsSelf-Routing setting, average across tasksTable 2 shows Step-GRPO (Self-Routing) 55.93 vs SFT 42.70Table 2

What To Try In 7 Days

Run R1-Router code on a small multimodal QA slice to compare F1-Recall vs your current RAG pipeline.

Replace fixed multi-KB retrieval with per-step retriever selection and measure average retrieval calls per query.

Train Step-GRPO on a handful of high-quality reasoning trajectories and compare SFT vs Step-GRPO routing accuracy.

Agent Features

Memory
Hybrid knowledge bases (text, image-text, tables)
Planning
Step-wise retrieval planningIntermediate query generation
Tool Use
Text RetrieverText-Image RetrieverTable Retriever
Frameworks
GRPO
Is Agentic

Yes

Architectures
Multimodal LLM backbone + modular retrievers

Optimization Features

Token Efficiency
Fewer retrieval results concatenated when router stops early
System Optimization
Adaptive per-step KB selection to avoid broad multi-KB retrieval
Training Optimization
GRPOSFT
Inference Optimization
Limits to max 3 retrieval stepsReduces unnecessary retrievals in VQA/Table QA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

2WikiMultihopQAInfoSeekDyn-VQAWebQAOpen-WikiTableTabFactM-BEIRWikipedia dumps

Risks & Boundaries

Limitations

Adds inference overhead from multi-step reasoning and extra retriever calls.

Step-GRPO is trained on a small synthetic set of golden trajectories that may contain errors.

When Not To Use

When strict low-latency is essential and extra reasoning steps are unacceptable.

For simple single-hop queries where single-shot retrieval is sufficient.

Failure Modes

Incorrect retriever selection leads to irrelevant evidence and wrong answers.

Poorly filtered training trajectories can teach suboptimal reasoning policies.

Core Entities

Models

Qwen2.5-VL-7BR1-Distill-Qwen-32BBGE-M3Qwen2.5-VL-7B (used as backbone in baselines)

Metrics

F1-RecallAccuracy

Datasets

2WikiMultihopQAInfoSeekDyn-VQAWebQAOpen-WikiTableTabFactM-BEIRWikipedia dump (20241020)

Benchmarks

Text QA (2WikiMultihopQA)Visual QA (InfoSeek, Dyn-VQA, WebQA)Table QA (Open-WikiTable, TabFact)

Context Entities

Models

IterRetGenIRCoTCogPlannerOmniSearchSearch-O1