Overview
Vid-LLMs are already practical for search, summarization, and QA when paired with strong vision backbones; production deployment needs work on hallucination control, longform scaling, and evaluation standards.
Citations6
Evidence Strength0.80
Confidence0.86
Risk Signals14
Trust Signals
Findings with numeric evidence: 3/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.
Who Should Care
Summary TLDR
This paper surveys methods that combine large language models (LLMs) with video processing (Vid-LLMs). It defines three integration patterns (Analyzer×LLM, Embedder×LLM, Hybrid), five LLM roles (Summarizer, Manager, Text Decoder, Regressor, Hidden Layer), and reviews architectures, training styles (training-free, adapters, full fine-tune), tasks, datasets, benchmarks, and evaluation practices up to June 2024. The survey highlights strong gains on QA and captioning benchmarks, the importance of frame sampling and visual backbones, recurring evaluation blind spots (GPT-based graders, version bias), and practical gaps: long-form, fine-grained, multimodal alignment, hallucination, and deployment
Problem Statement
Video volume is exploding and hand labeling is infeasible. Traditional video models struggle with openended, multi-granularity reasoning. The field needs clear ways to marry strong LLM reasoning with video encoders so systems can summarize, answer, localize, and reason over videos at scale.
Main Contribution
A clear taxonomy of Vid-LLMs: Video Analyzer×LLM, Video Embedder×LLM, and (Analyzer+Embedder)×LLM
A functional breakdown of LLM roles in video systems: Summarizer, Manager, Text Decoder, Regressor, Hidden Layer
Key Findings
LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.
Top Vid-LLMs score strongly on open-ended video QA, indicating improved reasoning over video content.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dense video captioning (ActivityNet) | CIDEr 41.2 | Non-LLM best ~33.1 (CM2) | +~8 | ActivityNet Captions | Streaming GIT achieves CIDEr 41.2 vs CM2 33.1 in Table IV | Table IV |
| Open-end zero-shot Video QA (MSVD-QA) | Accuracy 76.7 | — | — | MSVD-QA | IG-VLM reports MSVD-QA 76.7 (Table V) | Table V |
What To Try In 7 Days
Run a simple Analyzer×LLM prototype: transcribe and caption a sample video set, then feed text to an off-the-shelf LLM for QA.
Evaluate a Video Embedder×LLM demo using a CLIP/ViT encoder + a small LLM (Vicuna) and a linear projector.
Compare GPT-based automated scoring vs. 10 human judgments on a small validation set to measure evaluator bias before trusting automated metrics.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Hallucination: LLMs sometimes invent facts not grounded in video
Evaluation instability: GPT-based graders change with model versions and prompts
When Not To Use
Real-time, low-latency edge applications without model compression
Tasks requiring guaranteed per-frame pixel-accurate tracking
Failure Modes
Output hallucination or inventing events
Temporal misalignment: wrong timestamps or event order

