Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
96
Why It Matters For Business
You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.
Summary TLDR
The paper introduces Multimodal-CoT: a two-stage fine-tuning method that first generates a rationale (explanation) from image+text and then predicts the answer using that rationale plus the image. For T5-based models under 1B parameters, fusing ViT image features into this two-stage pipeline raises ScienceQA accuracy from ~78.6% (two-stage baseline) to 85.3% (base) and to 90.45% (large), reduces hallucinated rationales, and speeds up convergence. Generated (pseudo) rationales from large models can replace human explanations with a modest drop.
Problem Statement
Small language models (<1B params) struggle to use chain-of-thought (CoT) reasoning: generating intermediate rationales can harm final answers because rationales are often hallucinated without visual context. The paper asks whether fusing vision features and separating rationale generation from answer inference fixes this for multimodal QA.
Main Contribution
Define Multimodal-CoT: a two-stage framework that first generates rationales from image+text and then infers answers using those rationales plus the image.
Show that fusing patch-level vision features (ViT) into T5-based models reduces hallucinated rationales and improves accuracy on ScienceQA and A-OKVQA.
Demonstrate that pseudo-rationales generated by large models (InstructBLIP/ChatGPT) can train Multimodal-CoT when human rationales are absent.
Key Findings
Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.
Adding vision features in a two-stage setup improves rationale quality and answer accuracy.
Multimodal-CoT achieves state-of-the-art on ScienceQA for sub‑1B models and scales to larger fine-tuning sizes.
Training with pseudo-rationales from large models is effective with a modest gap to human rationales.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Hallucination correction rate after adding vision
Who Should Care
What To Try In 7 Days
Fine-tune a T5-base model in two stages: (1) generate rationales from image+text, (2) infer answers using rationale+image.
Fuse patch-level ViT features into the text encoder before decoding, not just image captions.
If you lack human rationales, generate pseudo-rationales with a large model (InstructBLIP/ChatGPT) and fine-tune Multimodal-CoT on them.
Optimization Features
System Optimization
- Keep vision extractor frozen (ViT) to save training cost
Training Optimization
- SFT
- Use pseudo-rationales from large models to avoid manual annotation
Inference Optimization
- Separate rationale generation and answer inference to reuse rationales and reduce noise
Reproducibility
Data Urls
- ScienceQA dataset
- A-OKVQA dataset
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High failure rate on commonsense-heavy questions (map reading, counting).
- Performance still far below top multimodal models with massive parameters on some out-of-domain tasks.
- Relies on annotated rationales or good pseudo-rationales; generated rationales incur a small performance drop.
When Not To Use
- Tasks that demand strong commonsense or symbolic counting without extra knowledge sources.
- Applications that require state-of-the-art general multimodal understanding from very large models (e.g., GPT-4V/Gemini).
Failure Modes
- Hallucinated rationales that mislead final answers when vision features are absent.
- Correct or empty rationales that nevertheless do not change a wrong final answer.
- Commonsense and map-reading errors even when visual fusion is present.
Core Entities
Models
- T5 (FLAN-Alpaca/FLAN-T5/UnifiedQA)
- ViT (frozen patch extractor)
- InstructBLIP
- LLaVA
- LLaMA-Adapter
- GPT-3.5
- GPT-4
- ChatGPT
Metrics
- Accuracy
- RougeL
Datasets
- ScienceQA
- A-OKVQA
- MMMU (evaluated zero-shot)
Benchmarks
- ScienceQA
- A-OKVQA
- MMMU

