Learned router that decides when and where to fetch facts from multiple KBs during stepwise multimodal reasoning

Overview

Decision SnapshotNeeds Validation

The paper shows consistent benchmark gains and step-level RL benefits, but training relies on moderate-size synthetic trajectories and a 3-step inference cap, so expect further engineering before large-scale deployment.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Zhiyuan Liu, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Adaptive routing reduces unnecessary retrievals and raises answer quality for mixed text/image/table queries, cutting retrieval cost and improving accuracy for knowledge-heavy apps.

Who Should Care

ML Engineer Product Manager Data Scientist CTO

Summary TLDR

R1-Router trains a multimodal LLM to generate intermediate queries, choose which knowledge base to consult (text, image-text, table), and integrate retrieved evidence step-by-step. Training uses a new RL objective, Step-GRPO, that assigns rewards per reasoning step for better query quality and routing. On mixed text/visual/table QA sets R1-Router improves average F1-Recall versus strong baselines (about +7 percentage points on evaluated benchmarks), cuts unnecessary retrievals for VQA and Table QA, and releases code.

Problem Statement

Existing multimodal retrieval-augmented methods fetch from many knowledge bases in a fixed way and do not let the model decide dynamically which KB to query during stepwise reasoning. That rigidity wastes retrievals and limits accuracy on multi-step, multi-modality QA.

Main Contribution

R1-Router: a framework that lets an MLLM generate intermediate queries and route them to specific KBs (text, text-image, table) during iterative reasoning.

Step-GRPO: a step-wise RL objective that gives per-step rewards for query relevance, routing accuracy, and intermediate answer quality.

Key Findings

R1-Router raises average F1-Recall across evaluated QA benchmarks.

NumbersAvg F1-Recall 55.93 vs 48.29 (IterRetGen), +7.64 pts

Practical UseSwap in R1-Router to boost multimodal QA accuracy on mixed-domain testbeds; expect single-digit to mid-single-digit absolute F1 gains.

Evidence RefTable 1

Step-GRPO training beats supervised fine-tuning and prompt-only methods.

NumbersSelf-routing avg: Step-GRPO 55.93 vs SFT 42.70, +13.23 pts

Practical UseUse step-wise RL (Step-GRPO) rather than only SFT to improve both query generation and KB routing.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average F1-Recall (mixed QA evaluation)	55.93	48.29 (IterRetGen)	+7.64 pts	Avg over Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA	Table 1 reports R1-Router avg 55.93 vs IterRetGen 48.29	Table 1
Self-routing ablation (Avg F1-Recall)	55.93 (Step-GRPO)	42.70 (SFT)	+13.23 pts	Self-Routing setting, average across tasks	Table 2 shows Step-GRPO (Self-Routing) 55.93 vs SFT 42.70	Table 2

What To Try In 7 Days

Run R1-Router code on a small multimodal QA slice to compare F1-Recall vs your current RAG pipeline.

Replace fixed multi-KB retrieval with per-step retriever selection and measure average retrieval calls per query.

Train Step-GRPO on a handful of high-quality reasoning trajectories and compare SFT vs Step-GRPO routing accuracy.

Agent Features

Memory

Hybrid knowledge bases (text, image-text, tables)

Planning

Step-wise retrieval planningIntermediate query generation

Tool Use

Text RetrieverText-Image RetrieverTable Retriever

Frameworks

GRPO

Is Agentic

Yes

Architectures

Multimodal LLM backbone + modular retrievers

Optimization Features

Token Efficiency

Fewer retrieval results concatenated when router stops early

System Optimization

Adaptive per-step KB selection to avoid broad multi-KB retrieval

Training Optimization

GRPOSFT

Inference Optimization

Limits to max 3 retrieval stepsReduces unnecessary retrievals in VQA/Table QA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/OpenBMB/R1-Router

Data URLs

2WikiMultihopQAInfoSeekDyn-VQAWebQAOpen-WikiTableTabFactM-BEIRWikipedia dumps

Risks & Boundaries

Limitations

Adds inference overhead from multi-step reasoning and extra retriever calls.

Step-GRPO is trained on a small synthetic set of golden trajectories that may contain errors.

When Not To Use

When strict low-latency is essential and extra reasoning steps are unacceptable.

For simple single-hop queries where single-shot retrieval is sufficient.

Failure Modes

Incorrect retriever selection leads to irrelevant evidence and wrong answers.

Poorly filtered training trajectories can teach suboptimal reasoning policies.

Core Entities

Models

Qwen2.5-VL-7BR1-Distill-Qwen-32BBGE-M3Qwen2.5-VL-7B (used as backbone in baselines)

Metrics

F1-RecallAccuracy

Datasets

2WikiMultihopQAInfoSeekDyn-VQAWebQAOpen-WikiTableTabFactM-BEIRWikipedia dump (20241020)

Benchmarks

Text QA (2WikiMultihopQA)Visual QA (InfoSeek, Dyn-VQA, WebQA)Table QA (Open-WikiTable, TabFact)

Context Entities

Models

IterRetGenIRCoTCogPlannerOmniSearchSearch-O1

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

R1-Router raises average F1-Recall across evaluated QA benchmarks.

Step-GRPO training beats supervised fine-tuning and prompt-only methods.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding