Overview
The survey collects evidence across many tasks and datasets. It shows practical pipelines (retrieval→pruning→fusion→reasoning) that practitioners can adopt, but many highperforming methods rely on large pre-trained models and curated MMKGs, which carry engineering cost and dataset biases.
Citations28
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.
Who Should Care
Summary TLDR
This 41-page survey reviews over 300 papers that connect knowledge graphs (KGs) with multi-modal learning. It splits the field into two angles: KG4MM (using KGs to help multimodal tasks like VQA, captioning, retrieval) and MM4KG (building multimodal KGs that contain images, text, audio, etc.). The review catalogs MMKG datasets, construction pipelines, representation and fusion methods, and task families (VQA, zero-shot image classification, entity alignment, KG completion). It highlights practical lessons: use sub-KG retrieval and pruning for scalable knowledge injection; combine dense retrievers with symbolic KG checks for reliability; treat images as attributes (A-MMKG) for fast progress,
Problem Statement
KGs and multi-modal models have evolved separately. Practitioners need a clear, unified picture: how to build and use multi-modal KGs, how KGs help vision+language tasks, which datasets and benchmarks exist, and what gaps remain when applying LLMs and VLMs together with structured knowledge.
Main Contribution
Surveyed 300+ KG–multimodal works and created a unified taxonomy across two views: KG-driven multimodal (KG4MM) and multi-modal KGs (MM4KG).
Cataloged MMKG datasets, ontologies, and construction paradigms (A-MMKG vs N-MMKG) with a comparative table.
Key Findings
The survey covers more than 300 related papers.
There are 20+ notable MMKG datasets and resources built from 2009–2023 (ImageNet→MMpedia/TIVA-KG).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 64.28% | 27.84% (Marino et al. retriever+ArticleNet) | +36.44 pp | OKVQA | GPT-4V reported 64.28% on OKVQA (Table IV) | Table IV |
| DBP15K entity alignment H@1 | 0.973 | 0.540 (HMEA) | +0.433 | DBP15K ZH-EN (Table IX) | MEAformer and iterative techniques give H@1 ~0.97 (Table IX) | Table IX |
What To Try In 7 Days
Wire a simple retriever + KG lookup into an existing VQA or retrieval pipeline and compare accuracy vs baseline.
Attach surface-name and image-attribute features from DBpedia/IMGPedia to a small product dataset and measure alignment gains.
Run a small dense retriever (FAISS+DPR) and a lightweight pruning step to see if retrieved KG facts improve model answers on 50 hard queries.
Reproducibility
Risks & Boundaries
Limitations
Survey focuses mainly on vision-language (image+text) settings; other modalities (audio, scientific modalities) receive less coverage.
Comparisons across methods are confounded by different knowledge sources, backbone models, and usage of gold supporting facts.
When Not To Use
When you need a lightweight, purely visual model without external knowledge — adding KGs increases complexity and storage.
When real-time, low-latency constraints forbid external retrieval or large VLM/LLM inference.
Failure Modes
Incorrect or noisy retrieval: irrelevant KG facts can mislead downstream reasoning.
Modality mismatch: images that do not represent the intended concept (aspect ambiguity) lead to wrong alignments.

