Overview
The paper shows consistent benchmark gains and step-level RL benefits, but training relies on moderate-size synthetic trajectories and a 3-step inference cap, so expect further engineering before large-scale deployment.
Citations0
Evidence Strength0.70
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 2/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Adaptive routing reduces unnecessary retrievals and raises answer quality for mixed text/image/table queries, cutting retrieval cost and improving accuracy for knowledge-heavy apps.
Who Should Care
Summary TLDR
R1-Router trains a multimodal LLM to generate intermediate queries, choose which knowledge base to consult (text, image-text, table), and integrate retrieved evidence step-by-step. Training uses a new RL objective, Step-GRPO, that assigns rewards per reasoning step for better query quality and routing. On mixed text/visual/table QA sets R1-Router improves average F1-Recall versus strong baselines (about +7 percentage points on evaluated benchmarks), cuts unnecessary retrievals for VQA and Table QA, and releases code.
Problem Statement
Existing multimodal retrieval-augmented methods fetch from many knowledge bases in a fixed way and do not let the model decide dynamically which KB to query during stepwise reasoning. That rigidity wastes retrievals and limits accuracy on multi-step, multi-modality QA.
Main Contribution
R1-Router: a framework that lets an MLLM generate intermediate queries and route them to specific KBs (text, text-image, table) during iterative reasoning.
Step-GRPO: a step-wise RL objective that gives per-step rewards for query relevance, routing accuracy, and intermediate answer quality.
Key Findings
R1-Router raises average F1-Recall across evaluated QA benchmarks.
Step-GRPO training beats supervised fine-tuning and prompt-only methods.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average F1-Recall (mixed QA evaluation) | 55.93 | 48.29 (IterRetGen) | +7.64 pts | Avg over Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA | Table 1 reports R1-Router avg 55.93 vs IterRetGen 48.29 | Table 1 |
| Self-routing ablation (Avg F1-Recall) | 55.93 (Step-GRPO) | 42.70 (SFT) | +13.23 pts | Self-Routing setting, average across tasks | Table 2 shows Step-GRPO (Self-Routing) 55.93 vs SFT 42.70 | Table 2 |
What To Try In 7 Days
Run R1-Router code on a small multimodal QA slice to compare F1-Recall vs your current RAG pipeline.
Replace fixed multi-KB retrieval with per-step retriever selection and measure average retrieval calls per query.
Train Step-GRPO on a handful of high-quality reasoning trajectories and compare SFT vs Step-GRPO routing accuracy.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Adds inference overhead from multi-step reasoning and extra retriever calls.
Step-GRPO is trained on a small synthetic set of golden trajectories that may contain errors.
When Not To Use
When strict low-latency is essential and extra reasoning steps are unacceptable.
For simple single-hop queries where single-shot retrieval is sufficient.
Failure Modes
Incorrect retriever selection leads to irrelevant evidence and wrong answers.
Poorly filtered training trajectories can teach suboptimal reasoning policies.

