Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
6
Why It Matters For Business
Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.
Summary TLDR
This paper surveys methods that combine large language models (LLMs) with video processing (Vid-LLMs). It defines three integration patterns (Analyzer×LLM, Embedder×LLM, Hybrid), five LLM roles (Summarizer, Manager, Text Decoder, Regressor, Hidden Layer), and reviews architectures, training styles (training-free, adapters, full fine-tune), tasks, datasets, benchmarks, and evaluation practices up to June 2024. The survey highlights strong gains on QA and captioning benchmarks, the importance of frame sampling and visual backbones, recurring evaluation blind spots (GPT-based graders, version bias), and practical gaps: long-form, fine-grained, multimodal alignment, hallucination, and deployment
Problem Statement
Video volume is exploding and hand labeling is infeasible. Traditional video models struggle with openended, multi-granularity reasoning. The field needs clear ways to marry strong LLM reasoning with video encoders so systems can summarize, answer, localize, and reason over videos at scale.
Main Contribution
A clear taxonomy of Vid-LLMs: Video Analyzer×LLM, Video Embedder×LLM, and (Analyzer+Embedder)×LLM
A functional breakdown of LLM roles in video systems: Summarizer, Manager, Text Decoder, Regressor, Hidden Layer
A concise review of training strategies: training-free, adapter-based (connective/insertive), hybrid, and full fine-tuning
A synthesis of datasets, benchmarks, evaluation methods, and practical failure points including hallucination and evaluation bias
A prioritized list of open problems and industry-relevant directions (long-form, fine-grained, multimodal, deployment)
Key Findings
LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.
Top Vid-LLMs score strongly on open-ended video QA, indicating improved reasoning over video content.
Temporal performance correlates with frame coverage: models that process many frames perform better on temporal tasks.
Evaluation using GPT-style LLM graders is convenient but unstable and biased.
Hallucination remains a major failure mode for Vid-LLMs, driven by modality misalignment and inherent LLM tendencies.
Results
Dense video captioning (ActivityNet)
Open-end zero-shot Video QA (MSVD-QA)
Video-based generative performance (average human-like scores)
Who Should Care
What To Try In 7 Days
Run a simple Analyzer×LLM prototype: transcribe and caption a sample video set, then feed text to an off-the-shelf LLM for QA.
Evaluate a Video Embedder×LLM demo using a CLIP/ViT encoder + a small LLM (Vicuna) and a linear projector.
Compare GPT-based automated scoring vs. 10 human judgments on a small validation set to measure evaluator bias before trusting automated metrics.
Agent Features
Memory
- Short-term context via prompts
- External temporary databases for longer context
Planning
- LLM as Manager (task coordination and tool calling)
- LLM-driven multi-step retrieval and summarization
Tool Use
- Invoke vision models/APIs
- Call analyzers (ASR, OCR, trackers)
- Use temporary DBs for retrieval
Frameworks
- Q-former
- LoRA
- Projection layers and cross-attention bridges
Is Agentic
true
Architectures
- Video Analyzer × LLM
- Video Embedder × LLM
- (Analyzer + Embedder) × LLM
Collaboration
- Orchestrates multiple analyzers and embedders
- Multi-turn interaction with analyzers
Optimization Features
Token Efficiency
- Reduce visual tokens via frame sampling
- Projector/Q-former token compression
Infra Optimization
- GPU/TPU clusters for heavy pretraining and fine-tuning
- Multi-GPU setups for large visual encoders
Model Optimization
- LoRA
- Projection/Q-former for modality mapping
System Optimization
- Modular architectures (separate encoders + LLM) for swap-in upgrades
- Cache and reuse analyzer outputs to cut repeated vision compute
Training Optimization
- Connective adapter fine-tuning (freeze LLM/encoder)
- Insertive adapter fine-tuning (insert into LLM)
- Hybrid multi-stage adapter training
- LLM fully fine-tuning for maximal task adaptation
Inference Optimization
- Training-free designs where analyzer produces text (no LLM fine-tune)
- Frame sampling strategies to trade compute vs temporal coverage
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Hallucination: LLMs sometimes invent facts not grounded in video
- Evaluation instability: GPT-based graders change with model versions and prompts
- Long-form video understanding still weak and compute-heavy
- Multimodal alignment gaps across audio, visual, and text
- High compute and dataset needs for fine-grained spatiotemporal tasks
When Not To Use
- Real-time, low-latency edge applications without model compression
- Tasks requiring guaranteed per-frame pixel-accurate tracking
- Sensitive video data where sharing to external LLMs breaches privacy
- High-assurance decision-making without added grounding/verification
Failure Modes
- Output hallucination or inventing events
- Temporal misalignment: wrong timestamps or event order
- Bias from skewed video/text training data
- Evaluator bias: automated scorers reward GPT-like phrasing
- Over-reliance on single-frame signals for video-level questions
Core Entities
Models
- Video-LLaMA
- IG-VLM
- PLLaVA
- AVicuna
- VTimeLLM
- MiniGPT4-video
- ST-LLM
- Vid2Seq
- Streaming GIT
- GPT4-V / GPT4V
Metrics
- CIDEr
- METEOR
- BLEU
- ROUGE-L
- Recall@K
- tIoU
- mAP
Datasets
- ActivityNet Captions
- MSRVTT-QA
- MSVD-QA
- Kinetics-400
- SomethingSomethingV2
- TVQA
- NExT-QA
- CinePile
- EgoSchema
Benchmarks
- MSRVTT-QA
- MSVD-QA
- ActivityNet-QA
- MVBench
- Video-Bench
- Video-MME
- CinePile
- InfiniBench

