Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Adaptive routing reduces unnecessary retrievals and raises answer quality for mixed text/image/table queries, cutting retrieval cost and improving accuracy for knowledge-heavy apps.
Summary TLDR
R1-Router trains a multimodal LLM to generate intermediate queries, choose which knowledge base to consult (text, image-text, table), and integrate retrieved evidence step-by-step. Training uses a new RL objective, Step-GRPO, that assigns rewards per reasoning step for better query quality and routing. On mixed text/visual/table QA sets R1-Router improves average F1-Recall versus strong baselines (about +7 percentage points on evaluated benchmarks), cuts unnecessary retrievals for VQA and Table QA, and releases code.
Problem Statement
Existing multimodal retrieval-augmented methods fetch from many knowledge bases in a fixed way and do not let the model decide dynamically which KB to query during stepwise reasoning. That rigidity wastes retrievals and limits accuracy on multi-step, multi-modality QA.
Main Contribution
R1-Router: a framework that lets an MLLM generate intermediate queries and route them to specific KBs (text, text-image, table) during iterative reasoning.
Step-GRPO: a step-wise RL objective that gives per-step rewards for query relevance, routing accuracy, and intermediate answer quality.
Empirical gains: consistent improvements across Text QA, Visual QA, and Table QA (average F1-Recall up ~7 pts), plus fewer retrieval steps in VQA and Table QA. Code released.
Key Findings
R1-Router raises average F1-Recall across evaluated QA benchmarks.
Step-GRPO training beats supervised fine-tuning and prompt-only methods.
R1-Router reduces the number of retrieval steps needed for correct answers in VQA and Table QA.
Step-GRPO shifts retrieval preference from image-based retrievers toward text retrievers when appropriate.
Results
Average F1-Recall (mixed QA evaluation)
Self-routing ablation (Avg F1-Recall)
Example per-dataset: WebQA F1-Recall
Who Should Care
What To Try In 7 Days
Run R1-Router code on a small multimodal QA slice to compare F1-Recall vs your current RAG pipeline.
Replace fixed multi-KB retrieval with per-step retriever selection and measure average retrieval calls per query.
Train Step-GRPO on a handful of high-quality reasoning trajectories and compare SFT vs Step-GRPO routing accuracy.
Agent Features
Memory
- Hybrid knowledge bases (text, image-text, tables)
Planning
- Step-wise retrieval planning
- Intermediate query generation
Tool Use
- Text Retriever
- Text-Image Retriever
- Table Retriever
Frameworks
- GRPO
Is Agentic
true
Architectures
- Multimodal LLM backbone + modular retrievers
Optimization Features
Token Efficiency
- Fewer retrieval results concatenated when router stops early
System Optimization
- Adaptive per-step KB selection to avoid broad multi-KB retrieval
Training Optimization
- GRPO
- SFT
Inference Optimization
- Limits to max 3 retrieval steps
- Reduces unnecessary retrievals in VQA/Table QA
Reproducibility
Code Urls
Data Urls
- 2WikiMultihopQA
- InfoSeek
- Dyn-VQA
- WebQA
- Open-WikiTable
- TabFact
- M-BEIR
- Wikipedia dumps
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Adds inference overhead from multi-step reasoning and extra retriever calls.
- Step-GRPO is trained on a small synthetic set of golden trajectories that may contain errors.
- Evaluation focuses on curated QA datasets; real-world KB distributions may differ.
When Not To Use
- When strict low-latency is essential and extra reasoning steps are unacceptable.
- For simple single-hop queries where single-shot retrieval is sufficient.
- When there is no access to modality-specific retrievers or heterogeneous KBs.
Failure Modes
- Incorrect retriever selection leads to irrelevant evidence and wrong answers.
- Poorly filtered training trajectories can teach suboptimal reasoning policies.
- Potential loops or wasted steps if the model fails to decide to stop within max steps.
Core Entities
Models
- Qwen2.5-VL-7B
- R1-Distill-Qwen-32B
- BGE-M3
- Qwen2.5-VL-7B (used as backbone in baselines)
Metrics
- F1-Recall
- Accuracy
Datasets
- 2WikiMultihopQA
- InfoSeek
- Dyn-VQA
- WebQA
- Open-WikiTable
- TabFact
- M-BEIR
- Wikipedia dump (20241020)
Benchmarks
- Text QA (2WikiMultihopQA)
- Visual QA (InfoSeek, Dyn-VQA, WebQA)
- Table QA (Open-WikiTable, TabFact)
Context Entities
Models
- IterRetGen
- IRCoT
- CogPlanner
- OmniSearch
- Search-O1

