Survey of how large language models power modern video understanding (taxonomies, benchmarks, gaps)
Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.
Key finding
LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.
Numbers: ActivityNet CIDEr: Streaming GIT 41.2 (Table IV)

