STC connector + audio branch: stronger video and audio understanding for Video-LLMs

June 11, 20247 min

Overview

Decision SnapshotReady For Pilot

The STC connector and audio branch are practical changes that improve benchmarks; evaluations cover many tasks but rely partly on GPT-assisted judging and short fine-tuning schedules.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 70%

Novelty: 55%

Authors

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

Links

Abstract / PDF / Code

Why It Matters For Business

VideoLLaMA 2 improves video and audio understanding while keeping encoder/Large-Model changes minimal; this lowers data and compute needed to reach strong open-source performance and speeds integration into product pipelines.

Who Should Care

Summary TLDR

VideoLLaMA 2 upgrades Video-LLMs with a Spatial-Temporal Convolution (STC) connector (RegStage blocks + 3D conv) and a jointly trained audio branch (BEATs encoder). The model is trained in multi-stage video, audio, and joint audio-video phases on large public datasets. It outperforms prior open-source Video-LLMs on several video QA and caption benchmarks and shows strong audio/question-answering gains with far fewer training hours than some competing audio models. Code and models are released.

Problem Statement

Video-LLMs struggle to capture short- and long-range temporal patterns and often ignore synchronous audio. This limits accuracy on video QA and audio-video tasks and makes models inefficient when temporally redundant frame tokens are passed to LLM decoders.

Main Contribution

Spatial-Temporal Convolution (STC) connector: uses RegStage blocks and 3D convolution to compress and preserve local spatial-temporal details into fewer tokens.

Audio Branch with BEATs encoder and MLP projector: jointly trained to add audio understanding and audio-visual fusion.

Key Findings

Adding STC connector (RegStage + 3D conv) yields the best average video QA performance in the architecture sweep.

NumbersAvg. acc. 45.1 (Table 1 green line)

Practical UseUse 3D conv + local conv blocks (RegStage) in the connector to improve temporal fusion while keeping token count low.

Evidence RefTable 1

VideoLLaMA 2 improves multiple-choice video QA accuracy vs prior open-source models.

NumbersEgoSchema: 51.7% vs LLaVA-NeXT-Video 43.9% (VideoLLaMA2-7B, 16 frames)

Practical UseIf you need better video QA, adopt STC and the training recipe; expect ~7–8 pts accuracy gain on EgoSchema-like benchmarks.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy51.7%LLaVA-NeXT-Video 43.9%+7.8 ppEgoSchema (multi-choice VQA)VideoLLaMA2 (7B, 16 frames) vs LLaVA-NeXT-Video (Table 5)Table 5
Accuracy68.90%Qwen-Audio 57.90%+11.0 ppClotho-AQAVideoLLaMA2-A (7B) with ~4k hours vs Qwen-Audio (137k hrs) (Table 7)Table 7

What To Try In 7 Days

Swap your video adapter for a small STC-like module (3D conv + local conv blocks) and measure VQA gains.

Add a lightweight audio encoder (BEATs) plus an MLP projector and fine-tune jointly on a few AV datasets.

Run quick benchmarks on EgoSchema and Clotho-AQA to validate gains on your video and audio tasks.

Optimization Features

Token Efficiency
3D downsampling reduces spatial-temporal tokens (e.g., # tokens 576–1152 in ablations)
Model Optimization
Keep pre-trained visual/audio encoders frozen to reduce optimization scopeUse STC connector to compress tokens before LLM to lower LLM compute
Training Optimization

Multi-stage training: pretrain connector/projector, multi-task fine-tune, then joint AV fine-tune

Large global batch sizes (pretrain 1024, fine-tune 2048) and short epoch counts

Inference Optimization
Fixed frame sampling (8 or 16 frames) to control token budgetGreedy decoding for QA; low-temperature sampling for captions

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

LLM backbone search is limited; authors used Mistral, Mixtral, and Qwen2 but did not explore others.

Pretraining and fine-tuning run for very few epochs (1 pretrain, up to 3 fine-tune), which may limit final quality.

When Not To Use

If you need per-frame dense token reasoning across hundreds of frames; STC compresses tokens and may remove very fine-grained frame-level detail.

If you require fully reproducible public-data pipelines: parts of training data are large curated subsets and not bundled with the repo.

Failure Modes

Temporal misalignment when audio and frames are imperfectly synced or missing (authors pad silent tracks with zeros).

Hallucinated or overly generic captions due to greedy decoding and limited fine-tuning epochs.

Core Entities

Models

VideoLLaMA 2 (7B, 8x7B, 72B, VideoLLaMA2.1)STC Connector (RegStage + 3D Conv)BEATs audio encoderCLIP ViT-L/14 (clip-large-336)Mistral-7B-Instruct, Mixtral-8x7B-Instruct, Qwen2-7B/72B-Instruct

Metrics

AccuracyGPT-assisted correctness scores (OE-VQA / VC)Human-like ChatGPT scores for caption correctness and detailTask-specific scores (MUSIC-QA, VGGSound)

Datasets

Panda-70M (filtered subset 2.8M)WebVid-10M (4M used)VIDAL-10M (2.8M used)InternVid-10M (650K used)CC-3M (595K used)WavCaps, AudioCaps, Clotho, VGGSound, TUT2017, TUT2016, VocalSound, MusicCapsAVQA, AVQA-music, AVSD, SthSthv2, Kinetics-710, MSVD, ActivityNet, EgoQA

Benchmarks

EgoSchema, PerceptionTest, MV-Bench, VideoMME, MSVC (captioning)MSVD-QA, ActivityNet-QA, Video-ChatGPTClotho-AQA, TUT2017, VocalSoundMUSIC-QA, AVSD, VGGSound