Teach vision-language models to reason about user-pointed image regions using an LLM-distilled 1M corpus

December 8, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi

Links

Abstract / PDF

Why It Matters For Business

LSKD lets products accept user-pointed regions (tap/click) instead of long referring text, improving region-level answers and reducing UX friction for multimodal apps while using existing VL architectures.

Summary TLDR

The paper builds a pipeline (LSKD) that uses a large language model (ChatGPT) to generate localized commonsense Q+A+R triples anchored to image regions (IDs or region descriptions). A learned critic filters low-quality generated items. Training vision-language models (BLIP-2 variants) on the resulting 1M localized instances improves zero-shot performance on several region-focused benchmarks (VCR, Sherlock, VisualCOMET) and transfers some gains to non-localized tasks. The method requires no architecture change; only Q-Former finetuning is used.

Problem Statement

Current vision-language interfaces accept full images but not direct 'pointed' region references. Asking users to write precise referring expressions is cumbersome and error-prone. The paper asks: can we cheaply create localized commonsense data (region-aware Q/A/rationales) from an LLM, filter it, and distill that knowledge into models so they accept regions-as-input?

Main Contribution

LSKD pipeline: automated verbalization -> LLM sampling -> supervised critic filtering -> finetune student VL model.

Localized Commonsense Knowledge Corpus: ~1M Q/A/rationale triples over ~169K images with region IDs or region descriptions.

Zero-shot state-of-the-art on several localized visual reasoning benchmarks after distillation.

Human evals showing strong students (large language backbone) can match or beat teacher in informativeness.

Key Findings

Large localized corpus (machine-generated) improves region-based zero-shot accuracy.

NumbersVCR Q→AR: 28.0→33.4 (+5.4%); Sherlock: 19.5→29.7 (+10.2%)

Distillation improves non-localized tasks too.

NumbersSNLI-VE: 33.4→40.3 (+6.9%); Visual7W: 77.1→79.5 (+2.4%)

Filtering generated data with a supervised critic sharply raises human acceptability.

NumbersHuman accept rate ~45% raw → ~70% when keeping top 20% by critic (threshold ≈0.8)

Scaling synthetic corpus helps: 150K→1M instances yields consistent gains.

NumbersLSKD 150K vs 1M: consistent improvements across evaluated tasks (Table 4)

Student model size and language backbone matter for generative quality.

NumbersHuman overall preference: Mini-GPT4+LSKD 49.1% vs ChatGPT+verbalizers 45%; BLIP-2(FlanT5)+LSKD 41.2% vs ChatGPT 45%

Results

VCR Q → A (zero-shot)

Value59.0% (BLIP-2 ViT-G + LSKD)

Baseline56.1% (BLIP-2 ViT-G)

VCR QA → R (zero-shot)

Value56.4% (BLIP-2 ViT-G + LSKD)

Baseline49.8% (BLIP-2 ViT-G)

VCR Q → AR (zero-shot)

Value33.4% (BLIP-2 ViT-G + LSKD)

Baseline28.0% (BLIP-2 ViT-G)

Sherlock Comparison (zero-shot)

Value29.7% (BLIP-2 ViT-G + LSKD)

Baseline19.5% (BLIP-2 ViT-G)

VisualCOMET Acc@50 (zero-shot)

Value40.3% (BLIP-2 ViT-G + LSKD)

Baseline39.0% (BLIP-2 ViT-G)

SNLI-VE (zero-shot)

Value40.3% (BLIP-2 ViT-G + LSKD)

Baseline33.4% (BLIP-2 ViT-G)

Visual7W Telling QA (zero-shot)

Value79.5% (BLIP-2 ViT-G + LSKD)

Baseline77.1% (BLIP-2 ViT-G)

Who Should Care

What To Try In 7 Days

Produce image verbalizations (global, local, QA pairs) for a small image set.

Prompt an instruction-tuned LLM to generate region-aware Q/A/rationales (3× per image).

Annotate ~20K examples to train a critic and filter top-quality outputs (threshold ≈0.8).

Optimization Features

Infra Optimization

  • Training using 4×80GB A100 GPUs (reports resource usage)

Model Optimization

  • Symbolic knowledge distillation (LLM → VL student)
  • Freeze image & language encoder; finetune Q-Former only

System Optimization

  • Use of pre-generated verbalizations to let LLM reason without multi-modal inputs

Training Optimization

  • Critic-based aggressive filtering to improve data quality
  • Region-based augmentation: shuffle region IDs and variable region counts

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dependence on verbalizers: off-the-shelf region captioners and detectors introduce errors that can lead LLMs to hallucinate.
  • Critic is imperfect: filtering raises quality but cannot remove all incorrect instances.
  • Coverage gaps: some question types (e.g., object counts) are underrepresented in the corpus.
  • Spatial and fine-grained expression reasoning remain weak; models sometimes misinterpret poses or expressions.

When Not To Use

  • When precise spatial geometry or exact measurements are required (e.g., depth or exact positions).
  • If you cannot afford the compute to generate and filter large synthetic corpora.
  • When you require fully human-verified annotations for safety-critical decisions.

Failure Modes

  • Hallucinated visual facts due to wrong verbalizations.
  • Incorrect grounding of region IDs to visual entities.
  • Overfitting to teacher biases (LLM assumptions) in synthetic reasoning.
  • Errors in fine-grained expression or spatial inference.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo)
  • BLIP-2 (ViT-L, ViT-G)
  • CLIP (ViT-B-16, ViT-L-14x336)
  • Mini-GPT4 (Vicuna-13B)
  • FlanT5-XXL
  • OFA-Huge
  • ViT-G encoder
  • Q-Former

Metrics

  • Accuracy
  • Acc@50
  • Precision/Recall/F1 (critic)
  • Human preference percentages

Datasets

  • Localized Commonsense Knowledge Corpus (1M LSKD instances, ~169K images)
  • Visual Genome
  • VCR
  • Sherlock
  • VisualCOMET
  • AOKVQA
  • SNLI-VE
  • Visual7W
  • LVIS
  • Localized Narratives
  • RefCOCO/RefCOCO+/RefCOCOg

Benchmarks

  • VCR (Q→A, QA→R, Q→AR)
  • Sherlock Comparison
  • VisualCOMET Acc@50
  • AOKVQA multiple choice
  • SNLI-VE
  • Visual7W Telling QA