Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

February 2, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

96

Authors

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

Links

Abstract / PDF

Why It Matters For Business

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Summary TLDR

The paper introduces Multimodal-CoT: a two-stage fine-tuning method that first generates a rationale (explanation) from image+text and then predicts the answer using that rationale plus the image. For T5-based models under 1B parameters, fusing ViT image features into this two-stage pipeline raises ScienceQA accuracy from ~78.6% (two-stage baseline) to 85.3% (base) and to 90.45% (large), reduces hallucinated rationales, and speeds up convergence. Generated (pseudo) rationales from large models can replace human explanations with a modest drop.

Problem Statement

Small language models (<1B params) struggle to use chain-of-thought (CoT) reasoning: generating intermediate rationales can harm final answers because rationales are often hallucinated without visual context. The paper asks whether fusing vision features and separating rationale generation from answer inference fixes this for multimodal QA.

Main Contribution

Define Multimodal-CoT: a two-stage framework that first generates rationales from image+text and then infers answers using those rationales plus the image.

Show that fusing patch-level vision features (ViT) into T5-based models reduces hallucinated rationales and improves accuracy on ScienceQA and A-OKVQA.

Demonstrate that pseudo-rationales generated by large models (InstructBLIP/ChatGPT) can train Multimodal-CoT when human rationales are absent.

Key Findings

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

NumbersNo-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

Adding vision features in a two-stage setup improves rationale quality and answer accuracy.

NumbersRationale RougeL 90.73→93.46; answer 78.57%→85.31% (+6.74pp)

Multimodal-CoT achieves state-of-the-art on ScienceQA for sub‑1B models and scales to larger fine-tuning sizes.

NumbersMultimodal-CoT Large 90.45% vs prior published best 86.54% (+3.91pp)

Training with pseudo-rationales from large models is effective with a modest gap to human rationales.

NumbersMultimodal-CoT w/ Generation 87.76% vs w/ Annotation 90.45% (−2.69pp)

Results

Accuracy

Value90.45%

BaselinePrior published best 86.54%

Accuracy

Value85.31%

BaselineTwo-stage baseline 78.57%

Accuracy

Value69.32%

BaselineNo-CoT 81.63%

Accuracy

Value50.57%

BaselineLanguage-only baseline 47.86%

Hallucination correction rate after adding vision

Value60.7% corrected

BaselineHallucination present in 56% of sampled errors

Who Should Care

What To Try In 7 Days

Fine-tune a T5-base model in two stages: (1) generate rationales from image+text, (2) infer answers using rationale+image.

Fuse patch-level ViT features into the text encoder before decoding, not just image captions.

If you lack human rationales, generate pseudo-rationales with a large model (InstructBLIP/ChatGPT) and fine-tune Multimodal-CoT on them.

Optimization Features

System Optimization

  • Keep vision extractor frozen (ViT) to save training cost

Training Optimization

  • SFT
  • Use pseudo-rationales from large models to avoid manual annotation

Inference Optimization

  • Separate rationale generation and answer inference to reuse rationales and reduce noise

Reproducibility

Data Urls

  • ScienceQA dataset
  • A-OKVQA dataset

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High failure rate on commonsense-heavy questions (map reading, counting).
  • Performance still far below top multimodal models with massive parameters on some out-of-domain tasks.
  • Relies on annotated rationales or good pseudo-rationales; generated rationales incur a small performance drop.

When Not To Use

  • Tasks that demand strong commonsense or symbolic counting without extra knowledge sources.
  • Applications that require state-of-the-art general multimodal understanding from very large models (e.g., GPT-4V/Gemini).

Failure Modes

  • Hallucinated rationales that mislead final answers when vision features are absent.
  • Correct or empty rationales that nevertheless do not change a wrong final answer.
  • Commonsense and map-reading errors even when visual fusion is present.

Core Entities

Models

  • T5 (FLAN-Alpaca/FLAN-T5/UnifiedQA)
  • ViT (frozen patch extractor)
  • InstructBLIP
  • LLaVA
  • LLaMA-Adapter
  • GPT-3.5
  • GPT-4
  • ChatGPT

Metrics

  • Accuracy
  • RougeL

Datasets

  • ScienceQA
  • A-OKVQA
  • MMMU (evaluated zero-shot)

Benchmarks

  • ScienceQA
  • A-OKVQA
  • MMMU