Survey of how large language models power modern video understanding (taxonomies, benchmarks, gaps)

December 29, 20239 min

Overview

Decision SnapshotNeeds Validation

Vid-LLMs are already practical for search, summarization, and QA when paired with strong vision backbones; production deployment needs work on hallucination control, longform scaling, and evaluation standards.

Citations6

Evidence Strength0.80

Confidence0.86

Risk Signals14

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yolo Y. Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

Links

Abstract / PDF / Code

Why It Matters For Business

Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.

Who Should Care

Summary TLDR

This paper surveys methods that combine large language models (LLMs) with video processing (Vid-LLMs). It defines three integration patterns (Analyzer×LLM, Embedder×LLM, Hybrid), five LLM roles (Summarizer, Manager, Text Decoder, Regressor, Hidden Layer), and reviews architectures, training styles (training-free, adapters, full fine-tune), tasks, datasets, benchmarks, and evaluation practices up to June 2024. The survey highlights strong gains on QA and captioning benchmarks, the importance of frame sampling and visual backbones, recurring evaluation blind spots (GPT-based graders, version bias), and practical gaps: long-form, fine-grained, multimodal alignment, hallucination, and deployment

Problem Statement

Video volume is exploding and hand labeling is infeasible. Traditional video models struggle with openended, multi-granularity reasoning. The field needs clear ways to marry strong LLM reasoning with video encoders so systems can summarize, answer, localize, and reason over videos at scale.

Main Contribution

A clear taxonomy of Vid-LLMs: Video Analyzer×LLM, Video Embedder×LLM, and (Analyzer+Embedder)×LLM

A functional breakdown of LLM roles in video systems: Summarizer, Manager, Text Decoder, Regressor, Hidden Layer

Key Findings

LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.

NumbersActivityNet CIDEr: Streaming GIT 41.2 (Table IV)

Practical UseUse LLM-backed captioning pipelines for richer, human-style summaries; expect better caption quality when using large visual encoders and LLM decoders.

Evidence RefTable IV

Top Vid-LLMs score strongly on open-ended video QA, indicating improved reasoning over video content.

NumbersMSVD-QA best: IG-VLM 76.7; many top models 7076 (Table V)

Practical UseFor question-answering pipelines prefer Vid-LLMs built on larger LLMs and strong visual encoders to improve zero-shot QA performance.

Evidence RefTable V

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dense video captioning (ActivityNet)CIDEr 41.2Non-LLM best ~33.1 (CM2)+~8ActivityNet CaptionsStreaming GIT achieves CIDEr 41.2 vs CM2 33.1 in Table IVTable IV
Open-end zero-shot Video QA (MSVD-QA)Accuracy 76.7MSVD-QAIG-VLM reports MSVD-QA 76.7 (Table V)Table V

What To Try In 7 Days

Run a simple Analyzer×LLM prototype: transcribe and caption a sample video set, then feed text to an off-the-shelf LLM for QA.

Evaluate a Video Embedder×LLM demo using a CLIP/ViT encoder + a small LLM (Vicuna) and a linear projector.

Compare GPT-based automated scoring vs. 10 human judgments on a small validation set to measure evaluator bias before trusting automated metrics.

Agent Features

Memory
Short-term context via promptsExternal temporary databases for longer context
Planning
LLM as Manager (task coordination and tool calling)LLM-driven multi-step retrieval and summarization
Tool Use
Invoke vision models/APIsCall analyzers (ASR, OCR, trackers)Use temporary DBs for retrieval
Frameworks
Q-formerLoRAProjection layers and cross-attention bridges
Is Agentic

Yes

Architectures
Video Analyzer × LLMVideo Embedder × LLM(Analyzer + Embedder) × LLM
Collaboration
Orchestrates multiple analyzers and embeddersMulti-turn interaction with analyzers

Optimization Features

Token Efficiency
Reduce visual tokens via frame samplingProjector/Q-former token compression
Infra Optimization
GPU/TPU clusters for heavy pretraining and fine-tuningMulti-GPU setups for large visual encoders
Model Optimization
LoRAProjection/Q-former for modality mapping
System Optimization
Modular architectures (separate encoders + LLM) for swap-in upgradesCache and reuse analyzer outputs to cut repeated vision compute
Training Optimization
Connective adapter fine-tuning (freeze LLM/encoder)Insertive adapter fine-tuning (insert into LLM)Hybrid multi-stage adapter trainingLLM fully fine-tuning for maximal task adaptation
Inference Optimization
Training-free designs where analyzer produces text (no LLM fine-tune)Frame sampling strategies to trade compute vs temporal coverage

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Hallucination: LLMs sometimes invent facts not grounded in video

Evaluation instability: GPT-based graders change with model versions and prompts

When Not To Use

Real-time, low-latency edge applications without model compression

Tasks requiring guaranteed per-frame pixel-accurate tracking

Failure Modes

Output hallucination or inventing events

Temporal misalignment: wrong timestamps or event order

Core Entities

Models

Video-LLaMAIG-VLMPLLaVAAVicunaVTimeLLMMiniGPT4-videoST-LLMVid2SeqStreaming GITGPT4-V / GPT4V

Metrics

CIDErMETEORBLEUROUGE-LRecall@KtIoUmAP

Datasets

ActivityNet CaptionsMSRVTT-QAMSVD-QAKinetics-400SomethingSomethingV2TVQANExT-QACinePileEgoSchema

Benchmarks

MSRVTT-QAMSVD-QAActivityNet-QAMVBenchVideo-BenchVideo-MMECinePileInfiniBench