You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.
Key finding
Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.
Numbers: No-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

