Survey of how large language models power modern video understanding (taxonomies, benchmarks, gaps)

Overview

Decision SnapshotNeeds Validation

Vid-LLMs are already practical for search, summarization, and QA when paired with strong vision backbones; production deployment needs work on hallucination control, longform scaling, and evaluation standards.

Citations6

Evidence Strength0.80

Confidence0.86

Risk Signals14

Trust Signals

Findings with numeric evidence: 3/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yolo Y. Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

Links

Abstract / PDF / Code

Why It Matters For Business

Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

This paper surveys methods that combine large language models (LLMs) with video processing (Vid-LLMs). It defines three integration patterns (Analyzer×LLM, Embedder×LLM, Hybrid), five LLM roles (Summarizer, Manager, Text Decoder, Regressor, Hidden Layer), and reviews architectures, training styles (training-free, adapters, full fine-tune), tasks, datasets, benchmarks, and evaluation practices up to June 2024. The survey highlights strong gains on QA and captioning benchmarks, the importance of frame sampling and visual backbones, recurring evaluation blind spots (GPT-based graders, version bias), and practical gaps: long-form, fine-grained, multimodal alignment, hallucination, and deployment

Problem Statement

Video volume is exploding and hand labeling is infeasible. Traditional video models struggle with openended, multi-granularity reasoning. The field needs clear ways to marry strong LLM reasoning with video encoders so systems can summarize, answer, localize, and reason over videos at scale.

Main Contribution

A clear taxonomy of Vid-LLMs: Video Analyzer×LLM, Video Embedder×LLM, and (Analyzer+Embedder)×LLM

A functional breakdown of LLM roles in video systems: Summarizer, Manager, Text Decoder, Regressor, Hidden Layer

Key Findings

LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.

NumbersActivityNet CIDEr: Streaming GIT 41.2 (Table IV)

Practical UseUse LLM-backed captioning pipelines for richer, human-style summaries; expect better caption quality when using large visual encoders and LLM decoders.

Evidence RefTable IV

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dense video captioning (ActivityNet)	CIDEr 41.2	Non-LLM best ~33.1 (CM2)	+~8	ActivityNet Captions	Streaming GIT achieves CIDEr 41.2 vs CM2 33.1 in Table IV	Table IV
Open-end zero-shot Video QA (MSVD-QA)	Accuracy 76.7	—	—	MSVD-QA	IG-VLM reports MSVD-QA 76.7 (Table V)	Table V

What To Try In 7 Days

Run a simple Analyzer×LLM prototype: transcribe and caption a sample video set, then feed text to an off-the-shelf LLM for QA.

Evaluate a Video Embedder×LLM demo using a CLIP/ViT encoder + a small LLM (Vicuna) and a linear projector.

Compare GPT-based automated scoring vs. 10 human judgments on a small validation set to measure evaluator bias before trusting automated metrics.

Agent Features

Memory

Short-term context via promptsExternal temporary databases for longer context

Planning

LLM as Manager (task coordination and tool calling)LLM-driven multi-step retrieval and summarization

Tool Use

Invoke vision models/APIsCall analyzers (ASR, OCR, trackers)Use temporary DBs for retrieval

Frameworks

Q-formerLoRAProjection layers and cross-attention bridges

Is Agentic

Yes

Architectures

Video Analyzer × LLMVideo Embedder × LLM(Analyzer + Embedder) × LLM

Collaboration

Orchestrates multiple analyzers and embeddersMulti-turn interaction with analyzers

Optimization Features

Token Efficiency

Reduce visual tokens via frame samplingProjector/Q-former token compression

Infra Optimization

GPU/TPU clusters for heavy pretraining and fine-tuningMulti-GPU setups for large visual encoders

Model Optimization

LoRAProjection/Q-former for modality mapping

System Optimization

Modular architectures (separate encoders + LLM) for swap-in upgradesCache and reuse analyzer outputs to cut repeated vision compute

Training Optimization

Connective adapter fine-tuning (freeze LLM/encoder)Insertive adapter fine-tuning (insert into LLM)Hybrid multi-stage adapter trainingLLM fully fine-tuning for maximal task adaptation

Inference Optimization

Training-free designs where analyzer produces text (no LLM fine-tune)Frame sampling strategies to trade compute vs temporal coverage

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yunlong10/Awesome-LLMs-forVideo-Understanding

Risks & Boundaries

Limitations

Hallucination: LLMs sometimes invent facts not grounded in video

Evaluation instability: GPT-based graders change with model versions and prompts

When Not To Use

Real-time, low-latency edge applications without model compression

Tasks requiring guaranteed per-frame pixel-accurate tracking

Failure Modes

Output hallucination or inventing events

Temporal misalignment: wrong timestamps or event order

Core Entities

Models

Video-LLaMAIG-VLMPLLaVAAVicunaVTimeLLMMiniGPT4-videoST-LLMVid2SeqStreaming GITGPT4-V / GPT4V

Metrics

CIDErMETEORBLEUROUGE-LRecall@KtIoUmAP

Datasets

ActivityNet CaptionsMSRVTT-QAMSVD-QAKinetics-400SomethingSomethingV2TVQANExT-QACinePileEgoSchema

Benchmarks

MSRVTT-QAMSVD-QAActivityNet-QAMVBenchVideo-BenchVideo-MMECinePileInfiniBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.

Top Vid-LLMs score strongly on open-ended video QA, indicating improved reasoning over video content.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-