Train a vision-language model to read and reason across many images in one prompt

September 14, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

18

Authors

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

Links

Abstract / PDF

Why It Matters For Business

If your product must reason over multiple images together (multi-photo chat, visual QA over albums, video snapshots), MMICL-style models reduce hallucinations and improve multi-image reasoning by adding explicit image tokens and multi-image instruction tuning.

Summary TLDR

MMICL is a method and dataset to give vision-language models (VLMs) true multi-image in‑context learning. Key ideas: (1) treat image and text tokens the same and feed them interleaved into an LLM; (2) add explicit image declaration tokens like “[IMG3]” so text can reference images unambiguously; (3) build a 5.8M-sample MIC dataset with multi-image, temporally/spatially linked examples and varied instruction templates. On many benchmarks MMICL improves multi-image reasoning and reduces language bias (e.g., +13 points on Winoground-style compositionality and +12 on RAVEN). Code and dataset are released.

Problem Statement

Current VLMs struggle with user prompts that mix multiple images and text. They fail to (1) link words to specific images, (2) reason about spatial/temporal/logical relations across images, and (3) learn from multi-image in‑context examples. This limits zero-shot/few-shot performance on complex vision–language tasks.

Main Contribution

Model: MMICL architecture that treats image and text embeddings equally and feeds interleaved image-text tokens into a frozen LLM.

Context scheme: explicit image declarations (e.g., “image j is [IMGj]”) plus image proxy tokens to make text-to-image references precise.

Dataset: MIC—a 5.8M-sample multi-modal in-context learning dataset built from 16 training datasets (8 categories) and 18 test datasets (10 categories).

Training recipe: two-stage training—(I) align visual features with the LLM using a Q-Former; (II) multi-modal in-context tuning on MIC while freezing the encoder and LLM and tuning projection/query/value vectors.

Empirical: state-of-the-art average scores on broad VLM benchmarks (MME, MMBench) and large gains on multi-image reasoning benchmarks (Winoground, RAVEN).

Key Findings

MMICL improves matching of captions to images on compositional image/text puzzles (Winoground).

NumbersText 45 / Image 45 / Group 43 (MMICL FLAN-T5-XXL, Table 2)

MMICL raises nonverbal multi-image reasoning accuracy on the RAVEN IQ test.

NumbersRAVEN accuracy 34% vs 22% for best baseline (Table 3, +12 points)

MMICL achieves the best reported average on the MME benchmark (per authors' comparisons).

NumbersMME Total Avg 129.33 (MMICL, Table 1)

MIC dataset scale and fine-tuning budget used.

NumbersMIC built with 5.8M samples; authors used ~10% for fine-tuning (Section 2.3)

MMICL reduces language bias on ScienceQA-IMG.

NumbersAverage 82.1%; gap between image-needed vs not = 0.9 (Table 5)

Results

Winoground group score (MMICL FLAN-T5-XXL)

Value43

BaselineBest previous group ~39 (GPT4-V 39.25)

Accuracy

Value34%

BaselineKOSMOS-1 22%

MME Total Avg

Value129.33

Baselineother VLMs lower (see Table 1)

MMBench overall score

Value65.24

BaselineJiuTian 64.7

ScienceQA-IMG average

Value82.1

BaselineInstructBLIP 71.3

MIC dataset size used

Value5.8M samples (MIC)

Who Should Care

What To Try In 7 Days

Add simple image-proxy tokens to your prompt format (e.g., “image 2 is [IMG2]”) and test whether the model maps mentions to images better.

Construct a small in-context dataset of linked images (frames or crops) and few-shot tune the projection/query-value layers while freezing the LLM.

Evaluate language bias by splitting your QA data into 'requires image' vs 'does not' and check performance gap; use image-declaration tuning to reduce it.

Optimization Features

Training Optimization

  • two-stage training: stage I align Q-former; stage II multi-modal ICL tuning (Sec. 2.4)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Context length: backbone LLM limits number of images; authors used up to eight images per instance (Sec. T.2).
  • Training budget: authors fine-tuned on ~10% of MIC due to compute limits; full-dataset results unknown (Sec. 2.3).
  • Architecture scope: experiments use Flan-T5 (encoder-decoder); effects on decoder-only LMs are not explored (Sec. T.3).

When Not To Use

  • When you need to support an unbounded number of images per prompt without model/context changes.
  • When your deployment uses a decoder-only LLM and you cannot adapt projection/query-value layers similarly.
  • When you cannot afford multi-stage tuning or lack curated multi-image examples.

Failure Modes

  • Residual hallucination: image-declaration reduces but does not eliminate hallucinated objects (paper evaluates but does not claim elimination).
  • Dependence on MIC quality: automated template rewriting and ChatGPT instruction refinement could introduce biases or noisy instructions.
  • Backbone dependence: improvements are shown with FLAN-T5 XL/XXL; transfer to other LLM families may need extra tuning.

Core Entities

Models

  • MMICL
  • BLIP-2
  • InstructBLIP
  • Flamingo
  • KOSMOS-1
  • Otter
  • Shikra

Metrics

  • Accuracy
  • Winoground image/text/group scores
  • MMBench overall score

Datasets

  • MIC (5.8M samples)
  • COCO
  • Flickr30K
  • VQAv2
  • VCR
  • MSRVTT
  • MSRVTT-QA
  • Winoground
  • RAVEN
  • ScienceQA-IMG
  • VizWiz

Benchmarks

  • MME
  • MMBench
  • Winoground
  • RAVEN
  • POPE
  • MM-VET