Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

February 12, 20257 min

Overview

Decision SnapshotReady For Pilot

The paper compiles strong evidence from many recent works but does not run new experiments, so it is valuable for planning and design rather than proving performance gains.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 20%

Authors

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multimodal RAG lets products ground language outputs in real images, audio, video, and documents, reducing hallucinations and enabling new services like evidence-backed visual Q&A and multimodal search.

Who Should Care

Summary TLDR

This paper is a wide-angle survey of multimodal Retrieval-Augmented Generation (RAG). It explains how models retrieve images, audio, video, and text from external sources and fuse them into LLM-based generation. The survey reviews retrieval strategies (dense, sparse, hybrid), fusion methods (score fusion, attention, unified projections), augmentation (iterative/adaptive retrieval), generation (in-context learning, chain-of-thought, instruction tuning), datasets and benchmarks, robustness and loss functions, agentic pipelines, and open challenges like modality bias, long-context scaling, and poisoning attacks. The authors curate >100 recent papers and a public repository to help practitioners

Problem Statement

Large language models hallucinate and age because they rely on static parametric memory. RAG fixes this by adding external knowledge. Extending RAG to multiple modalities (images, audio, video, documents) raises new retrieval, alignment, fusion, and evaluation challenges that lack a unified treatment. This survey maps the field, catalogs datasets/metrics, compares methods, and highlights gaps to guide applied work.

Main Contribution

Comprehensive review of multimodal RAG components: retrieval, fusion, augmentation, generation, training, and agents.

A structured taxonomy that groups recent papers by core technical ideas and application domains.

Key Findings

The field is active and diverse: the survey reviews over 100 recent papers.

Numbers100+ papers reviewed

Practical UseIf you build a multimodal RAG system, expect many design options; reuse established retrieval+fusion patterns rather than inventing from scratch.

Evidence RefContributions; Related Works

Evaluation uses many different metrics across retrieval and generation.

Numbersabout 60 distinct metrics reported

Practical UsePick a small, task-aligned metric set (e.g., Recall@K + ROUGE/CIDEr + CLIPScore) to avoid inconsistent comparisons.

Evidence RefC Evaluation and Metrics

What To Try In 7 Days

Prototype a simple multimodal RAG: use a dense image-text retriever (CLIP) + top-K passage retrieval + an LLM prompt with retrieved context.

Run focused evals: measure Recall@K for retrieval and ROUGE/CIDEr or CLIPScore for generation on a small domain dataset.

Add source attribution to outputs and test for mismatches and modality bias (text-dominant outputs).

Agent Features

Memory
episodic retrieval memoryexternal multimodal knowledge bases
Planning
plan-then-retrieve patternsquery decoupling for long videos
Tool Use
retrievers (MIPS/ScaNN)re-rankers and LLM-based reranking
Frameworks
agentic RAG (discussed, e.g., Goldenretriever/PlanRAG)self-guided multimodal retrieval pipelines
Architectures
vision-language backbonesdual-stream cross-attentionunified embedding projections
Collaboration
multi-agent coordination (surveyed trend)

Optimization Features

Token Efficiency
Query Dropout and context filtering to reduce token usage
Infra Optimization
TPU-KNN and distributed MIPS for large-scale retrieval
Model Optimization
dual-stream and cross-attention fusion to reduce redundant computationprojectors/MLPs to align embeddings
System Optimization
learned index structuressparse-dense hybrid retrieval to trade storage and recall
Training Optimization
contrastive pretraining (InfoNCE)hard-negative mining and mixupnoise-injected training for robustness
Inference Optimization
approximate MIPS for sublinear retrievalcoarse-to-fine retrieval and re-ranking

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Concise method descriptions due to space; not exhaustive for every subtechnique

No new experimental comparison across methods in this survey

When Not To Use

When you need reproducible benchmark numbers — use original papers and benchmark suites instead

When compute or latency budgets prevent running retrieval and multimodal encoders in production

Failure Modes

Hallucination when retrieval fails or returns irrelevant multimodal context

Modality bias: system falls back to text and ignores other modalities

Core Entities

Models

CLIPBLIPGPT-4RA-CM3MuRAGREVEALRA-BLIPMegapairsRavenSAM-RAG

Metrics

Recall@KMRRROUGEBLEUCIDErCLIPScoreFIDFAD

Datasets

LAION-5BLAION-400MMINT-1TMS-COCOFlickr30KVQAOK-VQAHowTo100MMIMIC-CXR

Benchmarks

M2RAGMRAG-BenchRAG-CheckDyn-VQAMMBench

Context Entities

Models

Q-FormerDINOv2Uni-IRMARVEL

Metrics

CIDErSPICEBERTScoreInception Score

Datasets

WebVidYouCook2ActivityNetAudioSetAudioCaps

Benchmarks

VisDoMOmniDocBenchMMLongBench-Doc