Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

Overview

Decision SnapshotReady For Pilot

The paper compiles strong evidence from many recent works but does not run new experiments, so it is valuable for planning and design rather than proving performance gains.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 20%

Authors

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Multimodal RAG lets products ground language outputs in real images, audio, video, and documents, reducing hallucinations and enabling new services like evidence-backed visual Q&A and multimodal search.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

This paper is a wide-angle survey of multimodal Retrieval-Augmented Generation (RAG). It explains how models retrieve images, audio, video, and text from external sources and fuse them into LLM-based generation. The survey reviews retrieval strategies (dense, sparse, hybrid), fusion methods (score fusion, attention, unified projections), augmentation (iterative/adaptive retrieval), generation (in-context learning, chain-of-thought, instruction tuning), datasets and benchmarks, robustness and loss functions, agentic pipelines, and open challenges like modality bias, long-context scaling, and poisoning attacks. The authors curate >100 recent papers and a public repository to help practitioners

Problem Statement

Large language models hallucinate and age because they rely on static parametric memory. RAG fixes this by adding external knowledge. Extending RAG to multiple modalities (images, audio, video, documents) raises new retrieval, alignment, fusion, and evaluation challenges that lack a unified treatment. This survey maps the field, catalogs datasets/metrics, compares methods, and highlights gaps to guide applied work.

Main Contribution

Comprehensive review of multimodal RAG components: retrieval, fusion, augmentation, generation, training, and agents.

A structured taxonomy that groups recent papers by core technical ideas and application domains.

Key Findings

The field is active and diverse: the survey reviews over 100 recent papers.

Numbers100+ papers reviewed

Practical UseIf you build a multimodal RAG system, expect many design options; reuse established retrieval+fusion patterns rather than inventing from scratch.

Evidence RefContributions; Related Works

Evaluation uses many different metrics across retrieval and generation.

Numbersabout 60 distinct metrics reported

Practical UsePick a small, task-aligned metric set (e.g., Recall@K + ROUGE/CIDEr + CLIPScore) to avoid inconsistent comparisons.

Evidence RefC Evaluation and Metrics

What To Try In 7 Days

Prototype a simple multimodal RAG: use a dense image-text retriever (CLIP) + top-K passage retrieval + an LLM prompt with retrieved context.

Run focused evals: measure Recall@K for retrieval and ROUGE/CIDEr or CLIPScore for generation on a small domain dataset.

Add source attribution to outputs and test for mismatches and modality bias (text-dominant outputs).

Agent Features

Memory

episodic retrieval memoryexternal multimodal knowledge bases

Planning

plan-then-retrieve patternsquery decoupling for long videos

Tool Use

retrievers (MIPS/ScaNN)re-rankers and LLM-based reranking

Frameworks

agentic RAG (discussed, e.g., Goldenretriever/PlanRAG)self-guided multimodal retrieval pipelines

Architectures

vision-language backbonesdual-stream cross-attentionunified embedding projections

Collaboration

multi-agent coordination (surveyed trend)

Optimization Features

Token Efficiency

Query Dropout and context filtering to reduce token usage

Infra Optimization

TPU-KNN and distributed MIPS for large-scale retrieval

Model Optimization

dual-stream and cross-attention fusion to reduce redundant computationprojectors/MLPs to align embeddings

System Optimization

learned index structuressparse-dense hybrid retrieval to trade storage and recall

Training Optimization

contrastive pretraining (InfoNCE)hard-negative mining and mixupnoise-injected training for robustness

Inference Optimization

approximate MIPS for sublinear retrievalcoarse-to-fine retrieval and re-ranking

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/llm-lab-org/Multimodal-RAG-Survey

Data URLs

https://arxiv.org/abs/2502.08826 https://arxiv.org/pdf/2502.08826v3

Risks & Boundaries

Limitations

Concise method descriptions due to space; not exhaustive for every subtechnique

No new experimental comparison across methods in this survey

When Not To Use

When you need reproducible benchmark numbers — use original papers and benchmark suites instead

When compute or latency budgets prevent running retrieval and multimodal encoders in production

Failure Modes

Hallucination when retrieval fails or returns irrelevant multimodal context

Modality bias: system falls back to text and ignores other modalities

Core Entities

Models

CLIPBLIPGPT-4RA-CM3MuRAGREVEALRA-BLIPMegapairsRavenSAM-RAG

Metrics

Recall@KMRRROUGEBLEUCIDErCLIPScoreFIDFAD

Datasets

LAION-5BLAION-400MMINT-1TMS-COCOFlickr30KVQAOK-VQAHowTo100MMIMIC-CXR

Benchmarks

M2RAGMRAG-BenchRAG-CheckDyn-VQAMMBench

Context Entities

Models

Q-FormerDINOv2Uni-IRMARVEL

Metrics

CIDErSPICEBERTScoreInception Score

Datasets

WebVidYouCook2ActivityNetAudioSetAudioCaps

Benchmarks

VisDoMOmniDocBenchMMLongBench-Doc

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The field is active and diverse: the survey reviews over 100 recent papers.

Evaluation uses many different metrics across retrieval and generation.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

Key finding

Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

Key finding

Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

Key finding

Not all retrieval noise is bad: some noises consistently help LLMs, others break them

Key finding

Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

Key finding