Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

February 12, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.2

Cost Impact Score

0.65

Citation Count

0

Authors

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari

Links

Abstract / PDF

Why It Matters For Business

Multimodal RAG lets products ground language outputs in real images, audio, video, and documents, reducing hallucinations and enabling new services like evidence-backed visual Q&A and multimodal search.

Summary TLDR

This paper is a wide-angle survey of multimodal Retrieval-Augmented Generation (RAG). It explains how models retrieve images, audio, video, and text from external sources and fuse them into LLM-based generation. The survey reviews retrieval strategies (dense, sparse, hybrid), fusion methods (score fusion, attention, unified projections), augmentation (iterative/adaptive retrieval), generation (in-context learning, chain-of-thought, instruction tuning), datasets and benchmarks, robustness and loss functions, agentic pipelines, and open challenges like modality bias, long-context scaling, and poisoning attacks. The authors curate >100 recent papers and a public repository to help practitioners

Problem Statement

Large language models hallucinate and age because they rely on static parametric memory. RAG fixes this by adding external knowledge. Extending RAG to multiple modalities (images, audio, video, documents) raises new retrieval, alignment, fusion, and evaluation challenges that lack a unified treatment. This survey maps the field, catalogs datasets/metrics, compares methods, and highlights gaps to guide applied work.

Main Contribution

Comprehensive review of multimodal RAG components: retrieval, fusion, augmentation, generation, training, and agents.

A structured taxonomy that groups recent papers by core technical ideas and application domains.

A curated resource list (datasets, benchmarks, code links) published in the project repository.

A frank discussion of open problems: robustness, long-context scaling, modality bias, and evaluation gaps.

Key Findings

The field is active and diverse: the survey reviews over 100 recent papers.

Numbers100+ papers reviewed

Evaluation uses many different metrics across retrieval and generation.

Numbersabout 60 distinct metrics reported

There exist extremely large pretraining corpora for multimodal work.

NumbersLAION-5B cited: 5.85B image-text pairs

Multimodal RAG systems commonly over-rely on text and suffer modality bias.

Adversarial knowledge injections can break multimodal RAG.

Who Should Care

What To Try In 7 Days

Prototype a simple multimodal RAG: use a dense image-text retriever (CLIP) + top-K passage retrieval + an LLM prompt with retrieved context.

Run focused evals: measure Recall@K for retrieval and ROUGE/CIDEr or CLIPScore for generation on a small domain dataset.

Add source attribution to outputs and test for mismatches and modality bias (text-dominant outputs).

Agent Features

Memory

  • episodic retrieval memory
  • external multimodal knowledge bases

Planning

  • plan-then-retrieve patterns
  • query decoupling for long videos

Tool Use

  • retrievers (MIPS/ScaNN)
  • re-rankers and LLM-based reranking

Frameworks

  • agentic RAG (discussed, e.g., Goldenretriever/PlanRAG)
  • self-guided multimodal retrieval pipelines

Architectures

  • vision-language backbones
  • dual-stream cross-attention
  • unified embedding projections

Collaboration

  • multi-agent coordination (surveyed trend)

Optimization Features

Token Efficiency

  • Query Dropout and context filtering to reduce token usage

Infra Optimization

  • TPU-KNN and distributed MIPS for large-scale retrieval

Model Optimization

  • dual-stream and cross-attention fusion to reduce redundant computation
  • projectors/MLPs to align embeddings

System Optimization

  • learned index structures
  • sparse-dense hybrid retrieval to trade storage and recall

Training Optimization

  • contrastive pretraining (InfoNCE)
  • hard-negative mining and mixup
  • noise-injected training for robustness

Inference Optimization

  • approximate MIPS for sublinear retrieval
  • coarse-to-fine retrieval and re-ranking

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Concise method descriptions due to space; not exhaustive for every subtechnique
  • No new experimental comparison across methods in this survey
  • Selection bias: focus on major venues and recent papers may miss niche domain work

When Not To Use

  • When you need reproducible benchmark numbers — use original papers and benchmark suites instead
  • When compute or latency budgets prevent running retrieval and multimodal encoders in production
  • When data privacy prevents sending multimodal inputs to external retrievers without secure orchestration

Failure Modes

  • Hallucination when retrieval fails or returns irrelevant multimodal context
  • Modality bias: system falls back to text and ignores other modalities
  • Adversarial knowledge poisoning and OCR-error cascades that mislead retrieval and generation

Core Entities

Models

  • CLIP
  • BLIP
  • GPT-4
  • RA-CM3
  • MuRAG
  • REVEAL
  • RA-BLIP
  • Megapairs
  • Raven
  • SAM-RAG

Metrics

  • Recall@K
  • MRR
  • ROUGE
  • BLEU
  • CIDEr
  • CLIPScore
  • FID
  • FAD

Datasets

  • LAION-5B
  • LAION-400M
  • MINT-1T
  • MS-COCO
  • Flickr30K
  • VQA
  • OK-VQA
  • HowTo100M
  • MIMIC-CXR

Benchmarks

  • M2RAG
  • MRAG-Bench
  • RAG-Check
  • Dyn-VQA
  • MMBench

Context Entities

Models

  • Q-Former
  • DINOv2
  • Uni-IR
  • MARVEL

Metrics

  • CIDEr
  • SPICE
  • BERTScore
  • Inception Score

Datasets

  • WebVid
  • YouCook2
  • ActivityNet
  • AudioSet
  • AudioCaps

Benchmarks

  • VisDoM
  • OmniDocBench
  • MMLongBench-Doc