VideoLLaMA 2 improves video and audio understanding while keeping encoder/Large-Model changes minimal; this lowers data and compute needed to reach strong open-source performance and speeds integration into product pipelines.
Key finding
Adding STC connector (RegStage + 3D conv) yields the best average video QA performance in the architecture sweep.
Numbers: Avg. acc. 45.1 (Table 1 green line)

