Vision-Language Models Papers — Parsed & Scored for Practitioners

A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

0.60

0.40

0.60

85

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Key finding

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

A practical map of how knowledge graphs and multimodal AI fit together today and where to push next

0.60

0.50

0.60

28

Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.

Key finding

The survey covers more than 300 related papers.

Numbers: ‘over 300 articles’ (abstract)

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

0.50

0.60

0.70

26

Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.

Key finding

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

Numbers: AY >80%; CH <10% (Figure 2, Appendix Tables 9-11)

Train a vision-language model to read and reason across many images in one prompt

0.60

0.70

0.50

18

If your product must reason over multiple images together (multi-photo chat, visual QA over albums, video snapshots), MMICL-style models reduce hallucinations and improve multi-image reasoning by adding explicit image tokens and multi-image instruction tuning.

Key finding

MMICL improves matching of captions to images on compositional image/text puzzles (Winoground).

Numbers: Text 45 / Image 45 / Group 43 (MMICL FLAN-T5-XXL, Table 2)

Systematic evaluation of GPT-4V and LLaVA on 1000+ vision+text engineering design tasks

0.30

0.60

0.50

17

VLMs like GPT-4V can speed up low-value, repetitive visual tasks (sketch similarity, captioning with handwriting) and help populate searchable design catalogs, but they currently cannot replace engineering checks that need precise spatial, numeric, or manufacturability guarantees.

Key finding

GPT-4V matches or exceeds human raters on sketch-similarity triplet tests.

Numbers: Self-consistency 94%; transitive violations = 5 (best human = 5).

MiniGPT-5: fuse an LLM with Stable Diffusion using 'generative vokens' for interleaved image+text outputs

0.60

0.70

0.60

16

MiniGPT-5 lets a single system produce coherent text and images together, cutting the need for separate caption→image pipelines and reducing integration overhead, while training only a small set of parameters.

Key finding

Humans prefer MiniGPT-5 outputs over a two-stage baseline on multimodal story tasks.

Numbers: Language continuity 55.22% vs 34.89%; image quality 52.43% vs 37.79%; multimodal coherence 56.9% vs 28.88%

Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

0.70

0.45

0.65

15

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Key finding

Most MM-LLMs add small adapters while keeping the core LLM frozen.

Numbers: Trainable params typically ≈2% (projectors only); PEFT can be <0.1%

STC connector + audio branch: stronger video and audio understanding for Video-LLMs

0.70

0.55

0.45

10

VideoLLaMA 2 improves video and audio understanding while keeping encoder/Large-Model changes minimal; this lowers data and compute needed to reach strong open-source performance and speeds integration into product pipelines.

Key finding

Adding STC connector (RegStage + 3D conv) yields the best average video QA performance in the architecture sweep.

Numbers: Avg. acc. 45.1 (Table 1 green line)

Hyper Attention for efficient, long multi-image and long-video understanding

0.60

0.62

0.60

6

mPLUG-Owl3 shows you can run an 8B multimodal model that is both accurate on many image/video tasks and more efficient on long visual inputs — useful for product features that need long-video or multi-image understanding.

Key finding

mPLUG-Owl3 achieves state-of-the-art among 8B models on a wide benchmark suite.

Numbers: SOTA on 14 of 20 evaluated benchmarks

Survey of how large language models power modern video understanding (taxonomies, benchmarks, gaps)

0.70

0.60

0.70

6

Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.

Key finding

LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.

Numbers: ActivityNet CIDEr: Streaming GIT 41.2 (Table IV)

Fuse object-level driving vectors into an LLM to explain and predict driving actions

0.30

0.60

0.40

6

Grounding compact numeric scene vectors into an LLM yields interpretable, language-based explanations and improves action reasoning in simulation; this accelerates prototyping of explainable driving features but is not yet production-ready for closed-loop control.

Key finding

Pretraining the vector-to-language stage improves Driving QA scores.

Numbers: GPT score: 8.39 vs 7.48 (10k finetune set; +0.91 abs, +9.1%)

FAVOR: frame-level audio+visual fusion and causal Q-Former to help LLMs understand speech, sounds and video together

0.60

5

FAVOR enables LLMs to reason over speech, sounds and video together at frame level, improving video QA and matching tasks that power search, content moderation, AV indexing, and multimedia assistants.

Key finding

FAVOR substantially improves video QA accuracy on the evaluated AVEB split.

Numbers: FAVOR 13B Video QA 49.3% vs InstructBLIP 13B 21.0% (Table 2)

AdaLink: non-intrusive input adapters that match full fine-tuning on many multimodal tasks

0.70

0.60

0.80

5

AdaLink cuts adaptation cost and serving complexity by tuning tiny adapters instead of full models, letting teams deploy many task-specific behaviors without copying huge models.

Key finding

AdaLink reaches near full fine-tuning on COCO captioning with instruction-tuned base.

Numbers: CIDEr: AdaLink 146.3 vs FT 147.0 (δ -0.65)

CAVG: fuse GPT‑4 emotion signals, cross‑modal attention and region‑wise layer fusion to ground driving commands

0.60

0.65

0.45

4

CAVG improves accuracy of mapping spoken commands to visual regions while keeping deployable latency and reducing required labeled data, cutting annotation cost and enabling more natural human-AV interaction.

Key finding

CAVG achieves IoU0.5 = 74.6% on the Talk2Car testset.

Numbers: IoU0.5 = 74.6%

Make visual tokens look like text tokens: learn a visual embedding table and probabilistic visual words.

0.60

0.70

0.50

4

Ovis improves multimodal understanding without larger LLM backbones, letting teams get better vision+language performance by changing the visual tokenization architecture rather than scaling model size.

Key finding

Ovis architecture beats an otherwise-identical connector-based MLLM.

Numbers: avg +8.8% across benchmarks (Table 3)

ChemVLM: an open-source vision+LLM tuned for chemical images, exams, and property prediction

0.60

0.50

3

ChemVLM reduces manual image-to-structure work and improves multimodal chemistry question answering; it can speed tasks that mix diagrams and text, but requires substantial compute.

Key finding

ChemVLM achieves strong chemical OCR quality among multimodal LLMs.

Numbers: Avg Tanimoto similarity 71% on ChemOCR

Survey of visual-focused multimodal LLMs: architectures, training, tasks, datasets, and open problems

0.60

0.80

3

MLLMs let products understand and generate images and language together, enabling visual assistants, grounded search, and image editing workflows — but expect high compute, hallucination risk, and evaluation blind spots.

Key finding

Typical MLLM design is three parts: visual encoder, LLM backbone, and adapter.

Ferret-UI: a multimodal LLM tuned to find, name, and act on mobile UI elements using an 'any-resolution' image split

0.60

0.65

0.45

3

Specialized multimodal models like Ferret-UI give more accurate reads and grounded actions on mobile screens than generalist VLMs, reducing errors in automation, accessibility features, and UI testing workflows.

Key finding

Ferret-UI substantially improves elementary referring and grounding accuracy compared to base Ferret and often surpasses GPT-4V on mobile UI primitives.

Numbers: Referring (iPhone): 82.4 vs GPT-4V 61.3; Grounding (iPhone): 81.4 vs GPT-4V 70.3

Automate rich, low-hallucination image captions by combining vision experts with multi-modal and text LLMs

0.60

3

IT automates high-quality, detailed captions that improve downstream image generation and reduce hallucination in vision-language systems, lowering labeling costs and improving model usefulness in search, generation, and retrieval.

Key finding

IT captions are substantially more informative and closer to human references than raw MLLM captions on automatic caption metrics.

Numbers: BLEU-1: 11.35 → 23.78 (IT-LLaVA) and 11.35 → 46.79 (IT-GPT4-V) on combined GT

Use vision-language models to auto-generate and iteratively correct multimodal instruction data

0.70

0.60

0.50

3

VIGC can cheaply scale multimodal instruction data and improve model performance on perception and knowledge VQA tasks, reducing the need for costly human annotation while trimming hallucinations through an automated correction loop.

Key finding

Fine-tuning with VIGC COCO data improved LLaVA-7B overall score.

Numbers: Overall 81.0 -> 85.8 (↑4.8)

GlassLLaVA: a vision-language model that interprets SEM images of glass using paper text and GPT-4–generated Q&A

0.40

0.60

0.40

3

Pairing image encoders with LLMs can automate interpretation of lab SEM images and speed defect triage, but the model needs context and domain-specific data to reach reliable accuracy.

Key finding

Context strongly improves answer quality.

Numbers: General: 68.84 (no context) → 92.56 (high context)

Generate short, unique text 'knowledge clues' with an LLM and use them to look up documents for multi-modal queries.

0.60

0.70

2

You can replace multiple modality-specific retrievers with one LLM-based generative retriever that scales to millions of documents, improves precision, and needs only light fine-tuning, lowering engineering and data costs.

Key finding

GeMKR raises P@5 on OKVQA-GS112K to 49.1, beating ReViz-ICT (41.7).

Numbers: P@5: 49.1 vs 41.7 (Table 1)

Reduce VLLM hallucinations by fine-tuning with AI-generated 'wrong' answers

0.60

0.70

2

POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.

Key finding

POVID substantially reduces object-hallucination on captioning benchmarks.

Numbers: CHAIR S: 66.8 → 31.8 (absolute -35.0)

A 187-task human-labeled dataset (1.66M instances) + two-stage tuning that needs only 1k GPT-4 examples to align VLM outputs

0.60

0.70

2

Investing in diverse, human-labeled vision tasks gives larger capability gains and less forgetting than mass synthetic labeling; a small alignment set (~1k GPT-4 examples) can deliver chat-style outputs while avoiding the cost and bias of large synthetic corpora.

Key finding

VISION-FLAN is large and diverse: 187 tasks and 1,664,261 instances.

Numbers: 187 tasks; 1,664,261 instances