FAVOR: frame-level audio+visual fusion and causal Q-Former to help LLMs understand speech, sounds and video together

October 9, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

5

Authors

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

FAVOR enables LLMs to reason over speech, sounds and video together at frame level, improving video QA and matching tasks that power search, content moderation, AV indexing, and multimedia assistants.

Summary TLDR

This paper introduces FAVOR, a method that fuses audio (speech + sounds) and visual frames at high temporal resolution and maps them into a language model input space. Key pieces are: a temporal synchronisation step, sliding windows, and a causal Q-Former (causal attention) that produces compact query tokens per window. A diversity loss reduces redundancy. On the AVEB benchmark (11 tasks, single- and cross-modal), FAVOR matches single-modality baselines and yields large gains on cross-modal reasoning: e.g., ~49.3% Video QA accuracy vs 21.0% for InstructBLIP (absolute +28.3 points) on the evaluated Video QA split. The method is computationally tunable (window size, FPS) and benefits speech+视觉

Problem Statement

Existing multimodal LLMs often treat video as a few sampled images and audio as a fixed spectrogram. They miss frame-level temporal alignment and speech understanding, which reduces performance on tasks that need fine-grained timing or speech+vision co-reasoning.

Main Contribution

FAVOR framework: frame-level audio-visual synchronisation, sliding windows, and alignment to LLM token space.

Causal Q-Former: adds a causal self-attention module to capture temporal causal relations across frames.

Diversity loss: penalises redundant query outputs so windows yield diverse information.

AVEB benchmark: 11 tasks (6 single-modal, 5 cross-modal) to test audio-visual perception and co-reasoning.

Key Findings

FAVOR substantially improves video QA accuracy on the evaluated AVEB split.

NumbersFAVOR 13B Video QA 49.3% vs InstructBLIP 13B 21.0% (Table 2)

FAVOR delivers strong audio-visual matching (AVM).

NumbersFAVOR 13B AVM 77.1% vs Video-LLaMA 7B 52.3% (Table 3)

Audio-visual sound-source detection improved but gains depend on baseline.

NumbersFAVOR 13B AVSSD 51.1% vs Video-LLaMA 41.9% (Table 3)

Speech recognition accuracy is competitive with strong ASR baselines.

NumbersFAVOR audio-only WER 2.7% vs Whisper large-v2 2.9% (Table 2)

Ablations show causal encoder, sliding windows and synchronisation are key.

NumbersRemoving causal encoder drops Video QA 49.3%→42.8%; removing synchronisation drops ISQA 32.3%→17.2% (Table 4)

Diversity loss trades off coverage vs hallucination.

NumbersHigh diversity factor increases hallucination and insertion errors in WER; small λ helps spread queries (Figure 4, Sec.

Results

Accuracy

Value49.3%

BaselineInstructBLIP 13B 21.0%

Accuracy

Value77.1%

BaselineVideo-LLaMA 7B 52.3%

Accuracy

Value51.1%

BaselineVideo-LLaMA 7B 41.9%

ASR WER (audio-only)

Value2.7%

BaselineWhisper large-v2 2.9%

AVSR WER (audio-visual speech recognition)

Value8.1%

BaselineWhisper large-v2 8.3%

Who Should Care

What To Try In 7 Days

Prototype frame-level sync: sample video at 2 FPS and align audio frames per video frame.

Add a causal self-attention block over per-frame multimodal tokens to capture temporal order.

Use sliding windows to control token budget and preserve local causality for long clips.

Agent Features

Tool Use

  • LoRA

Architectures

  • causal Q-Former
  • sliding-window query projection
  • LoRA

Optimization Features

Token Efficiency

  • controls output queries via window size and N (queries per window)

Model Optimization

  • LoRA

Training Optimization

  • multi-task instruction fine-tuning
  • diversity loss to spread query outputs
  • sliding-window training to handle variable length

Inference Optimization

  • sliding windows to trade tokens vs coverage
  • Accuracy

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher frame rates and larger windows increase token count and compute quickly.
  • Diversity loss must be tuned; large weight causes hallucination and worse WER.
  • Model depends on pre-trained encoders (Whisper, InstructBLIP) and paired audio-visual data for best results.
  • Not all AVEB splits are trained (some zero-shot), so performance mixes trained and zero-shot evaluations.

When Not To Use

  • If you only need a standalone state-of-the-art ASR or image-only system; specialised models may be simpler.
  • When compute budget cannot handle additional LLM tokens from high FPS or long windows.
  • If training data lacks synchronous audio-visual pairs; synchronisation is central to gains.

Failure Modes

  • Hallucination increases with aggressive diversity loss (high λ).
  • Missing synchronisation reduces co-reasoning and can drop ISQA/AVM performance sharply.
  • Too-large windows hurt ASR monotonic alignment and increase deletion errors.

Core Entities

Models

  • LoRA
  • Vicuna-7B
  • Vicuna-13B
  • Whisper large-v2
  • InstructBLIP (BLIP-2 + Q-Former)
  • Video-LLaMA

Metrics

  • WER
  • SPIDEr
  • CIDEr
  • METEOR
  • Accuracy

Datasets

  • AVEB (this paper)
  • LibriSpeech
  • AudioCaps
  • Flickr30k
  • TextVQA
  • GQA
  • NExT-QA
  • How2
  • Ego4D
  • VGGSS
  • SpokenCOCO
  • COCO

Benchmarks

  • AVEB