Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

Overview

Decision SnapshotReady For Pilot

Two-stage fine-tuning with fused vision features is a practical, medium-cost way to get reliable multimodal rationales on sub‑1B models; results are consistent across ablations and two benchmarks but commonsense/counting limits remain.

Citations96

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Who Should Care

ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The paper introduces Multimodal-CoT: a two-stage fine-tuning method that first generates a rationale (explanation) from image+text and then predicts the answer using that rationale plus the image. For T5-based models under 1B parameters, fusing ViT image features into this two-stage pipeline raises ScienceQA accuracy from ~78.6% (two-stage baseline) to 85.3% (base) and to 90.45% (large), reduces hallucinated rationales, and speeds up convergence. Generated (pseudo) rationales from large models can replace human explanations with a modest drop.

Problem Statement

Small language models (<1B params) struggle to use chain-of-thought (CoT) reasoning: generating intermediate rationales can harm final answers because rationales are often hallucinated without visual context. The paper asks whether fusing vision features and separating rationale generation from answer inference fixes this for multimodal QA.

Main Contribution

Define Multimodal-CoT: a two-stage framework that first generates rationales from image+text and then infers answers using those rationales plus the image.

Show that fusing patch-level vision features (ViT) into T5-based models reduces hallucinated rationales and improves accuracy on ScienceQA and A-OKVQA.

Key Findings

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

NumbersNo-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

Practical UseDon't force CoT generation on sub‑1B text-only models; it can reduce accuracy unless multimodal signals or a two-stage setup are used.

Evidence RefTable 2

Adding vision features in a two-stage setup improves rationale quality and answer accuracy.

NumbersRationale RougeL 90.73→93.46; answer 78.57%→85.31% (+6.74pp)

Practical UseFuse image features (patch-level ViT) with text when training CoT: you get more faithful rationales and substantially better answers on multimodal QA.

Evidence RefTable 3, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	90.45%	Prior published best 86.54%	+3.91pp	ScienceQA test	Table 4 shows Multimodal-CoT Large achieves 90.45% vs prior 86.54%	Table 4
Accuracy	85.31%	Two-stage baseline 78.57%	+6.74pp	ScienceQA test	Table 3 and Table 4 show base model with vision fusion at 85.31%	Table 3; Table 4

What To Try In 7 Days

Fine-tune a T5-base model in two stages: (1) generate rationales from image+text, (2) infer answers using rationale+image.

Fuse patch-level ViT features into the text encoder before decoding, not just image captions.

If you lack human rationales, generate pseudo-rationales with a large model (InstructBLIP/ChatGPT) and fine-tune Multimodal-CoT on them.

Optimization Features

System Optimization

Keep vision extractor frozen (ViT) to save training cost

Training Optimization

SFTUse pseudo-rationales from large models to avoid manual annotation

Inference Optimization

Separate rationale generation and answer inference to reuse rationales and reduce noise

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/amazon-science/mm-cot

Data URLs

ScienceQA datasetA-OKVQA dataset

Risks & Boundaries

Limitations

High failure rate on commonsense-heavy questions (map reading, counting).

Performance still far below top multimodal models with massive parameters on some out-of-domain tasks.

When Not To Use

Tasks that demand strong commonsense or symbolic counting without extra knowledge sources.

Applications that require state-of-the-art general multimodal understanding from very large models (e.g., GPT-4V/Gemini).

Failure Modes

Hallucinated rationales that mislead final answers when vision features are absent.

Correct or empty rationales that nevertheless do not change a wrong final answer.

Core Entities

Models

T5 (FLAN-Alpaca/FLAN-T5/UnifiedQA)ViT (frozen patch extractor)InstructBLIPLLaVALLaMA-AdapterGPT-3.5GPT-4ChatGPT

Metrics

AccuracyRougeL

Datasets

ScienceQAA-OKVQAMMMU (evaluated zero-shot)

Benchmarks

ScienceQAA-OKVQAMMMU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

Adding vision features in a two-stage setup improves rationale quality and answer accuracy.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-