VideoRAG: index and search unlimited‑length videos with graph grounding plus multi‑modal retrieval

Overview

Decision SnapshotReady For Pilot

Evidence comes from LLM-based head-to-head judgments, quantitative 1–5 scoring, ablations, and a curated 164-video benchmark; evaluations rely on GPT-4o-mini and YouTube-sourced data, which limit generality.

Citations2

Evidence Strength0.75

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VideoRAG enables searchable QA and summarization across many long videos, unlocking education, media-archive search, and customer-support video analytics without retraining large models.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

VideoRAG is a retrieval-augmented generation system built to index and answer questions over extremely long, multi‑video collections. It uses a dual-channel index: (1) a graph of text entities built from VLM captions + ASR, and (2) multi‑modal embeddings (ImageBind-style) for direct visual matching. The authors release LongerVideos (164 videos, ~134.6 hours, 602 queries). In LLM-based head-to-head judgments using GPT-4o-mini, VideoRAG was chosen more often than baselines (≈53% overall win rate vs Naive/graph/light RAG variants) and scores ~4.45/5 in quantitative comparisons vs baselines. Ablations show both graph grounding and visual indexing materially improve results. Code and the datasets

Problem Statement

Current RAG systems focus on text and short clips. Real problems require reasoning across many long videos: (1) how to extract and merge multi‑modal knowledge (visual, audio, transcripts); (2) how to preserve semantic links across videos; (3) how to retrieve the most relevant clips quickly from an unbounded video corpus.

Main Contribution

VideoRAG: a dual-channel RAG system combining graph-based textual grounding with multi-modal embeddings to index unlimited-length videos

LongerVideos benchmark: 164 videos (~134.6 hours) and 602 curated queries for cross-video evaluation

Key Findings

VideoRAG wins more LLM head-to-head judgments than standard RAG baselines

NumbersVideoRAG chosen 53.26% vs baselines' 46.74% (Overall Winner, Table 2)

Practical UseIf you need better answer quality on multi‑video queries, replace naive chunk‑based retrieval with VideoRAG's hybrid indexing to get modest but consistent improvements.

Evidence RefTable 2

Quantitative scoring rates VideoRAG well above other long-video methods

NumbersVideoRAG overall score ≈ 4.45/5 vs lower scores (Table 3)

Practical UseOn evaluated benchmarks, VideoRAG produces higher-rated answers; expect stronger, more detailed responses when judged by an LLM-based evaluator.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Win-rate (LLM head-to-head overall winner)	VideoRAG 53.26% vs NaiveRAG 46.74%	NaiveRAG	≈ +6.5 pp	LongerVideos (all)	VideoRAG chosen more often in pairwise comparisons across categories	Table 2
Quantitative overall score (1–5 scale vs baseline)	4.45	NaiveRAG baseline reference	—	LongerVideos (all)	Score assigned by LLM judge comparing to baseline answers	Table 3

What To Try In 7 Days

Run VideoRAG code on a small video collection (e.g., 5–10 hours) to test cross-video QA

Add LLM-based entity extraction on transcripts to build a simple knowledge graph for your videos

Compare hybrid retrieval (text+visual) vs text-only retrieval on a few representative queries

Agent Features

Tool Use

LLMs for indexing, query reformulation, and filteringVLMs for visual captioningASR for transcriptsImageBind-style multi-modal encoder

Frameworks

Graph-based indexing + embedding-based retrieval

Optimization Features

Token Efficiency

Chunking and entity synthesis reduces LLM context burden

System Optimization

Incremental graph construction for scalable knowledge updates

Inference Optimization

Indexing into text chunks and embeddings to avoid reprocessing full videos at query timeLLM filtering reduces downstream generation load by pruning irrelevant clips

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/HKUDS/VideoRAG

Data URLs

https://github.com/HKUDS/VideoRAG (LongerVideos dataset referenced)

Risks & Boundaries

Limitations

Relies on VLM and ASR quality; errors propagate into the graph and retrieval

Evaluation uses an LLM judge (GPT-4o-mini) which can introduce preference bias

When Not To Use

Real-time or low-latency systems where per-query embedding/LLM steps are too slow

Private or sensitive video collections without consent for third-party processing

Failure Modes

Noisy transcripts lead to wrong entity nodes and bad retrievals

Visual captions miss key scene details, causing retrieval misses

Core Entities

Models

MiniCPM-V (quantized VLM)Distil-Whisper (ASR)ImageBind (multi-modal encoder)text-embedding-3-small (OpenAI)GPT-4o-mini (LLM judge/generator)

Metrics

Win-rate comparison (LLM judge)5-point quantitative score (vs baseline)Comprehensiveness, Empowerment, Trustworthiness, Depth, Density

Datasets

LongerVideos

Benchmarks

LongerVideos (this work)

Context Entities

Models

LLaMA-VIDVideoAgentNotebookLMGraphRAGLightRAG

Metrics

Relative win-rateMean quantitative score

Datasets

MLVU (prior work)LVBench (prior work)

Benchmarks

Long-video QA benchmarks referenced (MLVU, LVBench)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

VideoRAG wins more LLM head-to-head judgments than standard RAG baselines

Quantitative scoring rates VideoRAG well above other long-video methods

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding