Survey of how large language models power modern video understanding (taxonomies, benchmarks, gaps)

December 29, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

6

Authors

Yolo Y. Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

Links

Abstract / PDF

Why It Matters For Business

Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.

Summary TLDR

This paper surveys methods that combine large language models (LLMs) with video processing (Vid-LLMs). It defines three integration patterns (Analyzer×LLM, Embedder×LLM, Hybrid), five LLM roles (Summarizer, Manager, Text Decoder, Regressor, Hidden Layer), and reviews architectures, training styles (training-free, adapters, full fine-tune), tasks, datasets, benchmarks, and evaluation practices up to June 2024. The survey highlights strong gains on QA and captioning benchmarks, the importance of frame sampling and visual backbones, recurring evaluation blind spots (GPT-based graders, version bias), and practical gaps: long-form, fine-grained, multimodal alignment, hallucination, and deployment

Problem Statement

Video volume is exploding and hand labeling is infeasible. Traditional video models struggle with openended, multi-granularity reasoning. The field needs clear ways to marry strong LLM reasoning with video encoders so systems can summarize, answer, localize, and reason over videos at scale.

Main Contribution

A clear taxonomy of Vid-LLMs: Video Analyzer×LLM, Video Embedder×LLM, and (Analyzer+Embedder)×LLM

A functional breakdown of LLM roles in video systems: Summarizer, Manager, Text Decoder, Regressor, Hidden Layer

A concise review of training strategies: training-free, adapter-based (connective/insertive), hybrid, and full fine-tuning

A synthesis of datasets, benchmarks, evaluation methods, and practical failure points including hallucination and evaluation bias

A prioritized list of open problems and industry-relevant directions (long-form, fine-grained, multimodal, deployment)

Key Findings

LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.

NumbersActivityNet CIDEr: Streaming GIT 41.2 (Table IV)

Top Vid-LLMs score strongly on open-ended video QA, indicating improved reasoning over video content.

NumbersMSVD-QA best: IG-VLM 76.7; many top models 70–76 (Table V)

Temporal performance correlates with frame coverage: models that process many frames perform better on temporal tasks.

NumbersHigh temporal performers often use 100+ frames (Sec IV.D)

Evaluation using GPT-style LLM graders is convenient but unstable and biased.

Hallucination remains a major failure mode for Vid-LLMs, driven by modality misalignment and inherent LLM tendencies.

Results

Dense video captioning (ActivityNet)

ValueCIDEr 41.2

BaselineNon-LLM best ~33.1 (CM2)

Open-end zero-shot Video QA (MSVD-QA)

ValueAccuracy 76.7

Video-based generative performance (average human-like scores)

ValueAverage ~3.0 (on 1–5 scale) for top models

BaselineLower-tier models ~2.0

Who Should Care

What To Try In 7 Days

Run a simple Analyzer×LLM prototype: transcribe and caption a sample video set, then feed text to an off-the-shelf LLM for QA.

Evaluate a Video Embedder×LLM demo using a CLIP/ViT encoder + a small LLM (Vicuna) and a linear projector.

Compare GPT-based automated scoring vs. 10 human judgments on a small validation set to measure evaluator bias before trusting automated metrics.

Agent Features

Memory

  • Short-term context via prompts
  • External temporary databases for longer context

Planning

  • LLM as Manager (task coordination and tool calling)
  • LLM-driven multi-step retrieval and summarization

Tool Use

  • Invoke vision models/APIs
  • Call analyzers (ASR, OCR, trackers)
  • Use temporary DBs for retrieval

Frameworks

  • Q-former
  • LoRA
  • Projection layers and cross-attention bridges

Is Agentic

true

Architectures

  • Video Analyzer × LLM
  • Video Embedder × LLM
  • (Analyzer + Embedder) × LLM

Collaboration

  • Orchestrates multiple analyzers and embedders
  • Multi-turn interaction with analyzers

Optimization Features

Token Efficiency

  • Reduce visual tokens via frame sampling
  • Projector/Q-former token compression

Infra Optimization

  • GPU/TPU clusters for heavy pretraining and fine-tuning
  • Multi-GPU setups for large visual encoders

Model Optimization

  • LoRA
  • Projection/Q-former for modality mapping

System Optimization

  • Modular architectures (separate encoders + LLM) for swap-in upgrades
  • Cache and reuse analyzer outputs to cut repeated vision compute

Training Optimization

  • Connective adapter fine-tuning (freeze LLM/encoder)
  • Insertive adapter fine-tuning (insert into LLM)
  • Hybrid multi-stage adapter training
  • LLM fully fine-tuning for maximal task adaptation

Inference Optimization

  • Training-free designs where analyzer produces text (no LLM fine-tune)
  • Frame sampling strategies to trade compute vs temporal coverage

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Hallucination: LLMs sometimes invent facts not grounded in video
  • Evaluation instability: GPT-based graders change with model versions and prompts
  • Long-form video understanding still weak and compute-heavy
  • Multimodal alignment gaps across audio, visual, and text
  • High compute and dataset needs for fine-grained spatiotemporal tasks

When Not To Use

  • Real-time, low-latency edge applications without model compression
  • Tasks requiring guaranteed per-frame pixel-accurate tracking
  • Sensitive video data where sharing to external LLMs breaches privacy
  • High-assurance decision-making without added grounding/verification

Failure Modes

  • Output hallucination or inventing events
  • Temporal misalignment: wrong timestamps or event order
  • Bias from skewed video/text training data
  • Evaluator bias: automated scorers reward GPT-like phrasing
  • Over-reliance on single-frame signals for video-level questions

Core Entities

Models

  • Video-LLaMA
  • IG-VLM
  • PLLaVA
  • AVicuna
  • VTimeLLM
  • MiniGPT4-video
  • ST-LLM
  • Vid2Seq
  • Streaming GIT
  • GPT4-V / GPT4V

Metrics

  • CIDEr
  • METEOR
  • BLEU
  • ROUGE-L
  • Recall@K
  • tIoU
  • mAP

Datasets

  • ActivityNet Captions
  • MSRVTT-QA
  • MSVD-QA
  • Kinetics-400
  • SomethingSomethingV2
  • TVQA
  • NExT-QA
  • CinePile
  • EgoSchema

Benchmarks

  • MSRVTT-QA
  • MSVD-QA
  • ActivityNet-QA
  • MVBench
  • Video-Bench
  • Video-MME
  • CinePile
  • InfiniBench