STC connector + audio branch: stronger video and audio understanding for Video-LLMs

June 11, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.45

Citation Count

10

Authors

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

Links

Abstract / PDF

Why It Matters For Business

VideoLLaMA 2 improves video and audio understanding while keeping encoder/Large-Model changes minimal; this lowers data and compute needed to reach strong open-source performance and speeds integration into product pipelines.

Summary TLDR

VideoLLaMA 2 upgrades Video-LLMs with a Spatial-Temporal Convolution (STC) connector (RegStage blocks + 3D conv) and a jointly trained audio branch (BEATs encoder). The model is trained in multi-stage video, audio, and joint audio-video phases on large public datasets. It outperforms prior open-source Video-LLMs on several video QA and caption benchmarks and shows strong audio/question-answering gains with far fewer training hours than some competing audio models. Code and models are released.

Problem Statement

Video-LLMs struggle to capture short- and long-range temporal patterns and often ignore synchronous audio. This limits accuracy on video QA and audio-video tasks and makes models inefficient when temporally redundant frame tokens are passed to LLM decoders.

Main Contribution

Spatial-Temporal Convolution (STC) connector: uses RegStage blocks and 3D convolution to compress and preserve local spatial-temporal details into fewer tokens.

Audio Branch with BEATs encoder and MLP projector: jointly trained to add audio understanding and audio-visual fusion.

Multi-stage training recipe: large weakly-labeled pretraining, multi-task fine-tuning, then audio-video joint training using many public datasets.

Empirical evaluation: outperforms open-source peers on multiple video QA and caption benchmarks and gives strong audio QA results with far fewer training hours.

Release: models and code are public to support follow-up work.

Key Findings

Adding STC connector (RegStage + 3D conv) yields the best average video QA performance in the architecture sweep.

NumbersAvg. acc. 45.1 (Table 1 green line)

VideoLLaMA 2 improves multiple-choice video QA accuracy vs prior open-source models.

NumbersEgoSchema: 51.7% vs LLaVA-NeXT-Video 43.9% (VideoLLaMA2-7B, 16 frames)

Audio-only question answering performs strongly despite much less audio training data.

NumbersClotho-AQA 68.9% (VideoLLaMA2-A, 4k hrs) vs Qwen-Audio 57.9% (137k hrs)

Joint audio-visual training further raises AVQA and music QA scores.

NumbersMUSIC-QA 79.2, VGGSound 70.9 (VideoLLaMA2-AV 7B)

Results

Accuracy

Value51.7%

BaselineLLaVA-NeXT-Video 43.9%

Accuracy

Value68.90%

BaselineQwen-Audio 57.90%

MSVC captioning correctness (GPT eval)

Value2.57 (score)

BaselineGPT4-V 2.70

Who Should Care

What To Try In 7 Days

Swap your video adapter for a small STC-like module (3D conv + local conv blocks) and measure VQA gains.

Add a lightweight audio encoder (BEATs) plus an MLP projector and fine-tune jointly on a few AV datasets.

Run quick benchmarks on EgoSchema and Clotho-AQA to validate gains on your video and audio tasks.

Optimization Features

Token Efficiency

  • 3D downsampling reduces spatial-temporal tokens (e.g., # tokens 576–1152 in ablations)

Model Optimization

  • Keep pre-trained visual/audio encoders frozen to reduce optimization scope
  • Use STC connector to compress tokens before LLM to lower LLM compute

Training Optimization

  • Multi-stage training: pretrain connector/projector, multi-task fine-tune, then joint AV fine-tune
  • Large global batch sizes (pretrain 1024, fine-tune 2048) and short epoch counts

Inference Optimization

  • Fixed frame sampling (8 or 16 frames) to control token budget
  • Greedy decoding for QA; low-temperature sampling for captions

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLM backbone search is limited; authors used Mistral, Mixtral, and Qwen2 but did not explore others.
  • Pretraining and fine-tuning run for very few epochs (1 pretrain, up to 3 fine-tune), which may limit final quality.
  • Some evaluations rely on GPT-assisted judgments, which can introduce judge bias and noise.

When Not To Use

  • If you need per-frame dense token reasoning across hundreds of frames; STC compresses tokens and may remove very fine-grained frame-level detail.
  • If you require fully reproducible public-data pipelines: parts of training data are large curated subsets and not bundled with the repo.

Failure Modes

  • Temporal misalignment when audio and frames are imperfectly synced or missing (authors pad silent tracks with zeros).
  • Hallucinated or overly generic captions due to greedy decoding and limited fine-tuning epochs.
  • Benchmark judge bias from GPT-assisted binary scoring can over- or under-estimate real-world correctness.

Core Entities

Models

  • VideoLLaMA 2 (7B, 8x7B, 72B, VideoLLaMA2.1)
  • STC Connector (RegStage + 3D Conv)
  • BEATs audio encoder
  • CLIP ViT-L/14 (clip-large-336)
  • Mistral-7B-Instruct, Mixtral-8x7B-Instruct, Qwen2-7B/72B-Instruct

Metrics

  • Accuracy
  • GPT-assisted correctness scores (OE-VQA / VC)
  • Human-like ChatGPT scores for caption correctness and detail
  • Task-specific scores (MUSIC-QA, VGGSound)

Datasets

  • Panda-70M (filtered subset 2.8M)
  • WebVid-10M (4M used)
  • VIDAL-10M (2.8M used)
  • InternVid-10M (650K used)
  • CC-3M (595K used)
  • WavCaps, AudioCaps, Clotho, VGGSound, TUT2017, TUT2016, VocalSound, MusicCaps
  • AVQA, AVQA-music, AVSD, SthSthv2, Kinetics-710, MSVD, ActivityNet, EgoQA

Benchmarks

  • EgoSchema, PerceptionTest, MV-Bench, VideoMME, MSVC (captioning)
  • MSVD-QA, ActivityNet-QA, Video-ChatGPT
  • Clotho-AQA, TUT2017, VocalSound
  • MUSIC-QA, AVSD, VGGSound