Practical guide: which design choices help when adding image input to LLMs

October 4, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

0

Authors

Utsav Garg, Erhan Bas

Links

Abstract / PDF

Why It Matters For Business

If you need image-aware language models, pick a large frozen vision encoder and train a small vision head plus diverse instruction tuning to get practical improvements without re-training huge backbones.

Summary TLDR

This paper benchmarks five public recipes for turning a text LLM into a multimodal model (BLIP-2, InstructBLIP, LLaVA, MiniGPT-4, mPLUG-Owl). Main takeaways: bigger frozen vision encoders and training a small vision head help; training the LLM during instruction tuning gives extra gains; data diversity in instruction tuning matters more than sheer size; current multimodal variants still hallucinate and struggle with factuality. Authors back claims with controlled ablations on vision head, encoder size, alignment data, and whether the LLM is fine-tuned.

Problem Statement

How do common architectural and data choices affect the zero-shot and generalization performance of multimodal LLMs? The paper tests vision encoder size, whether to train a vision head, how much alignment/instruction data is needed, and whether to fine-tune the language decoder.

Main Contribution

Systematic comparison of five public multimodal LLM recipes (BLIP-2, InstructBLIP, LLaVA, MiniGPT-4, mPLUG-Owl) across captioning, VQA, MCQ, binary classification, and complex reasoning.

Ablation study isolating effects of vision head training, multimodal vs image-only head, vision encoder size, alignment data size, instruction data size, and LLM fine-tuning.

Practical, actionable recommendations: use larger frozen vision encoders, train a compact vision head, fine-tune the decoder or use adapters, and prioritize diverse instruction tuning data.

Highlighting gaps: multimodal models still hallucinate and current evaluation (GPT-4 judging) has limitations when it does not see images directly.

Key Findings

InstructBLIP (diverse instruction data) performs best across evaluated tasks.

NumbersLLaVA VQA overall 83.3; NoCaps CIDEr 123.65 (Table 2)

Training a vision head (Q-Former) improves downstream scores versus no head.

NumbersOverall 80.2 (trained Q-Former) vs 78.5 (no head); +1.7 overall (Table 3)

Using a larger frozen vision encoder (ViT-g) consistently raises performance.

NumbersOverall 83.9 (ViT-g) vs 80.2 (ViT-L); +3.7 overall (Table 5)

Fine-tuning the language decoder (LLM) during instruction tuning yields large gains.

NumbersExample: overall 78.5 (LLM trained) vs 66.7 (LLM frozen) in same setup; +11.8 (Table 8)

Alignment and instruction data size show diminishing returns beyond modest amounts; diversity matters more.

NumbersAlignment 129M vs 595K: overall 83.1 vs 83.9 (Table 6); Instruction 150K vs 80K: overall 78 vs 78.5 (Table 7)

Open-ended evaluation using GPT-4 has blind spots and cannot see images directly.

NumbersAuthors note GPT-4 ranks text-only answers and use Balanced Position Calibration and 5 gen averaging (Section 2)

Results

LLaVA VQA (Overall, GPT-4 relative score)

ValueInstructBLIP 83.3

BaselineLLaVA 78.0

NoCaps (CIDEr)

ValueInstructBLIP 123.65

BaselineLLaVA 67.75

Accuracy

ValueInstructBLIP 59.49%

BaselineLLaVA 34.8%

Effect of vision encoder size (LLaVA VQA overall)

ValueViT-g overall 83.9 vs ViT-L 80.2

BaselineViT-L

Effect of training vision head (LLaVA VQA overall)

ValueTrained Q-Former overall 80.2 vs no head 78.5

Baselineno head

Who Should Care

What To Try In 7 Days

Swap in a larger frozen image encoder (ViT-g) and benchmark; expect a few points improvement.

Add and train a compact vision head (Q-Former) rather than passing raw patches.

If feasible, fine-tune the decoder; otherwise add LoRA adapters for multimodal mode only.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations rely on GPT-4 ranking that does not see images directly and can be noisy.
  • Models still hallucinate visual content and may assert nonexistent objects.
  • Results use public checkpoints and specific training recipes — other checkpoints could differ.
  • Data diversity, not just size, drives generalization; coverage gaps remain.

When Not To Use

  • For high-stakes factual decisions without human verification due to hallucination risk.
  • If you cannot supply diverse multimodal instruction data and expect broad OOD generalization.
  • When deployment cost prevents using larger vision encoders.

Failure Modes

  • Hallucinating objects or attributes not present in the image.
  • Overfitting to task formats seen during instruction tuning and failing OOD.
  • Misleading evaluation from text-only judges leading to cherry-picked improvements.

Core Entities

Models

  • BLIP-2
  • InstructBLIP
  • LLaVA
  • MiniGPT-4
  • mPLUG-Owl
  • Vicuna-7B
  • LLaMA-7B
  • Q-Former
  • ViT-L
  • ViT-g
  • Perceiver Resampler

Metrics

  • GPT-4 relative score (1-10 ranking)
  • CIDEr
  • Accuracy
  • Log-likelihood (MCQ)

Datasets

  • LLaVA VQA (LLaVA-150K)
  • NoCaps (val)
  • ScienceQA (image subset)
  • Visual Spatial Reasoning (VSR)
  • COCO
  • CC3M
  • LAION

Benchmarks

  • LLaVA VQA
  • NoCaps
  • ScienceQA (Image)
  • VSR