A practical map of how knowledge graphs and multimodal AI fit together today and where to push next

Overview

Decision SnapshotNeeds Validation

The survey collects evidence across many tasks and datasets. It shows practical pipelines (retrieval→pruning→fusion→reasoning) that practitioners can adopt, but many highperforming methods rely on large pre-trained models and curated MMKGs, which carry engineering cost and dataset biases.

Citations28

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo, Xiang Chen, Qian Li, Wen Zhang, Jiaoyan Chen, Yushan Zhu, Jiaqi Li, Xiaoze Liu, Jeff Z. Pan, Ningyu Zhang, Huajun Chen

Links

Abstract / PDF / Code

Why It Matters For Business

Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

This 41-page survey reviews over 300 papers that connect knowledge graphs (KGs) with multi-modal learning. It splits the field into two angles: KG4MM (using KGs to help multimodal tasks like VQA, captioning, retrieval) and MM4KG (building multimodal KGs that contain images, text, audio, etc.). The review catalogs MMKG datasets, construction pipelines, representation and fusion methods, and task families (VQA, zero-shot image classification, entity alignment, KG completion). It highlights practical lessons: use sub-KG retrieval and pruning for scalable knowledge injection; combine dense retrievers with symbolic KG checks for reliability; treat images as attributes (A-MMKG) for fast progress, 

Problem Statement

KGs and multi-modal models have evolved separately. Practitioners need a clear, unified picture: how to build and use multi-modal KGs, how KGs help vision+language tasks, which datasets and benchmarks exist, and what gaps remain when applying LLMs and VLMs together with structured knowledge.

Main Contribution

Surveyed 300+ KG–multimodal works and created a unified taxonomy across two views: KG-driven multimodal (KG4MM) and multi-modal KGs (MM4KG).

Cataloged MMKG datasets, ontologies, and construction paradigms (A-MMKG vs N-MMKG) with a comparative table.

Key Findings

The survey covers more than 300 related papers.

Numbers‘over 300 articles’ (abstract)

Practical UseRely on this paper as a broad starting map; use its bibliography to find recent methods, datasets, and benchmarks instead of re-searching from scratch.

Evidence RefAbstract

There are 20+ notable MMKG datasets and resources built from 2009–2023 (ImageNet→MMpedia/TIVA-KG).

NumbersTable III lists 20+ MMKGs (2013–2023)

Practical UseIf you need image-grounded knowledge, pick an existing MMKG (e.g., DBP15K/MMpedia/TIVA-KG) instead of building one from zero.

Evidence RefTable III

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	64.28%	27.84% (Marino et al. retriever+ArticleNet)	+36.44 pp	OKVQA	GPT-4V reported 64.28% on OKVQA (Table IV)	Table IV
DBP15K entity alignment H@1	0.973	0.540 (HMEA)	+0.433	DBP15K ZH-EN (Table IX)	MEAformer and iterative techniques give H@1 ~0.97 (Table IX)	Table IX

What To Try In 7 Days

Wire a simple retriever + KG lookup into an existing VQA or retrieval pipeline and compare accuracy vs baseline.

Attach surface-name and image-attribute features from DBpedia/IMGPedia to a small product dataset and measure alignment gains.

Run a small dense retriever (FAISS+DPR) and a lightweight pruning step to see if retrieved KG facts improve model answers on 50 hard queries.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zjukg/KG-MM-Survey

Risks & Boundaries

Limitations

Survey focuses mainly on vision-language (image+text) settings; other modalities (audio, scientific modalities) receive less coverage.

Comparisons across methods are confounded by different knowledge sources, backbone models, and usage of gold supporting facts.

When Not To Use

When you need a lightweight, purely visual model without external knowledge — adding KGs increases complexity and storage.

When real-time, low-latency constraints forbid external retrieval or large VLM/LLM inference.

Failure Modes

Incorrect or noisy retrieval: irrelevant KG facts can mislead downstream reasoning.

Modality mismatch: images that do not represent the intended concept (aspect ambiguity) lead to wrong alignments.

Core Entities

Models

CLIPBLIP-2ViLTUNITERLXMERTVL-BERTBERTRoBERTaT5GPT-3GPT-4/GPT-4VLLaMA

Metrics

AccuracyExact MatchH@1 (Hit@1)MRRF1BLEUCIDEr

Datasets

ImageNetVisualGenomeDBpediaDBP15KMMKG (2019)IMGpediaImageGraphGAIAVisualSemMMpediaUKnowMulti-OpenEAVTKGsTIVA-KGM2 ConceptBaseFVQAOKVQAVQA2.0A-OKVQAAwA2ImNet-A/ImNet-O

Benchmarks

OKVQAFVQAVQA2.0DBP15K (entity alignment)FB15K-237 (KGC)M2E2 / TVEE (event extraction)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The survey covers more than 300 related papers.

There are 20+ notable MMKG datasets and resources built from 2009–2023 (ImageNet→MMpedia/TIVA-KG).

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-