VideoRAG: index and search unlimited‑length videos with graph grounding plus multi‑modal retrieval

February 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

2

Authors

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang

Links

Abstract / PDF

Why It Matters For Business

VideoRAG enables searchable QA and summarization across many long videos, unlocking education, media-archive search, and customer-support video analytics without retraining large models.

Summary TLDR

VideoRAG is a retrieval-augmented generation system built to index and answer questions over extremely long, multi‑video collections. It uses a dual-channel index: (1) a graph of text entities built from VLM captions + ASR, and (2) multi‑modal embeddings (ImageBind-style) for direct visual matching. The authors release LongerVideos (164 videos, ~134.6 hours, 602 queries). In LLM-based head-to-head judgments using GPT-4o-mini, VideoRAG was chosen more often than baselines (≈53% overall win rate vs Naive/graph/light RAG variants) and scores ~4.45/5 in quantitative comparisons vs baselines. Ablations show both graph grounding and visual indexing materially improve results. Code and the datasets

Problem Statement

Current RAG systems focus on text and short clips. Real problems require reasoning across many long videos: (1) how to extract and merge multi‑modal knowledge (visual, audio, transcripts); (2) how to preserve semantic links across videos; (3) how to retrieve the most relevant clips quickly from an unbounded video corpus.

Main Contribution

VideoRAG: a dual-channel RAG system combining graph-based textual grounding with multi-modal embeddings to index unlimited-length videos

LongerVideos benchmark: 164 videos (~134.6 hours) and 602 curated queries for cross-video evaluation

A multi-modal retrieval pipeline: text entity matching + visual embedding search + LLM filtering

Open-source code and dataset release (GitHub link in paper)

Key Findings

VideoRAG wins more LLM head-to-head judgments than standard RAG baselines

NumbersVideoRAG chosen 53.26% vs baselines' 46.74% (Overall Winner, Table 2)

Quantitative scoring rates VideoRAG well above other long-video methods

NumbersVideoRAG overall score ≈ 4.45/5 vs lower scores (Table 3)

Both graph-based text grounding and visual retrieval are essential

NumbersAblation shows major performance drops when removing graph or vision components (Figure 2)

LongerVideos provides cross-video queries and scale for evaluation

Numbers164 videos, ~134.6 hours, 602 queries (Table 1)

Results

Win-rate (LLM head-to-head overall winner)

ValueVideoRAG 53.26% vs NaiveRAG 46.74%

BaselineNaiveRAG

Quantitative overall score (1–5 scale vs baseline)

Value4.45

BaselineNaiveRAG baseline reference

Dataset size

Value164 videos, ~134.6 hours, 602 queries

Who Should Care

What To Try In 7 Days

Run VideoRAG code on a small video collection (e.g., 5–10 hours) to test cross-video QA

Add LLM-based entity extraction on transcripts to build a simple knowledge graph for your videos

Compare hybrid retrieval (text+visual) vs text-only retrieval on a few representative queries

Agent Features

Tool Use

  • LLMs for indexing, query reformulation, and filtering
  • VLMs for visual captioning
  • ASR for transcripts
  • ImageBind-style multi-modal encoder

Frameworks

  • Graph-based indexing + embedding-based retrieval

Optimization Features

Token Efficiency

  • Chunking and entity synthesis reduces LLM context burden

System Optimization

  • Incremental graph construction for scalable knowledge updates

Inference Optimization

  • Indexing into text chunks and embeddings to avoid reprocessing full videos at query time
  • LLM filtering reduces downstream generation load by pruning irrelevant clips

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on VLM and ASR quality; errors propagate into the graph and retrieval
  • Evaluation uses an LLM judge (GPT-4o-mini) which can introduce preference bias
  • Multi‑modal encoders and graph construction add compute and storage costs at scale
  • Entity merging across videos can produce incorrect unifications if context is ambiguous

When Not To Use

  • Real-time or low-latency systems where per-query embedding/LLM steps are too slow
  • Private or sensitive video collections without consent for third-party processing
  • Very small single-video tasks where full video LLMs are simpler

Failure Modes

  • Noisy transcripts lead to wrong entity nodes and bad retrievals
  • Visual captions miss key scene details, causing retrieval misses
  • LLM-based filtering can discard useful clips under conservative prompts

Core Entities

Models

  • MiniCPM-V (quantized VLM)
  • Distil-Whisper (ASR)
  • ImageBind (multi-modal encoder)
  • text-embedding-3-small (OpenAI)
  • GPT-4o-mini (LLM judge/generator)

Metrics

  • Win-rate comparison (LLM judge)
  • 5-point quantitative score (vs baseline)
  • Comprehensiveness, Empowerment, Trustworthiness, Depth, Density

Datasets

  • LongerVideos

Benchmarks

  • LongerVideos (this work)

Context Entities

Models

  • LLaMA-VID
  • VideoAgent
  • NotebookLM
  • GraphRAG
  • LightRAG

Metrics

  • Relative win-rate
  • Mean quantitative score

Datasets

  • MLVU (prior work)
  • LVBench (prior work)

Benchmarks

  • Long-video QA benchmarks referenced (MLVU, LVBench)