FAVOR: frame-level audio+visual fusion and causal Q-Former to help LLMs understand speech, sounds and video together

October 9, 20238 min

Overview

Decision SnapshotNeeds Validation

The approach shows clear gains on cross-modal tasks and ablations confirm key components; however code/model checkpoints are not yet released and experiments rely on many public datasets and compute.

Citations5

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

FAVOR enables LLMs to reason over speech, sounds and video together at frame level, improving video QA and matching tasks that power search, content moderation, AV indexing, and multimedia assistants.

Who Should Care

Summary TLDR

This paper introduces FAVOR, a method that fuses audio (speech + sounds) and visual frames at high temporal resolution and maps them into a language model input space. Key pieces are: a temporal synchronisation step, sliding windows, and a causal Q-Former (causal attention) that produces compact query tokens per window. A diversity loss reduces redundancy. On the AVEB benchmark (11 tasks, single- and cross-modal), FAVOR matches single-modality baselines and yields large gains on cross-modal reasoning: e.g., ~49.3% Video QA accuracy vs 21.0% for InstructBLIP (absolute +28.3 points) on the evaluated Video QA split. The method is computationally tunable (window size, FPS) and benefits speech+视觉

Problem Statement

Existing multimodal LLMs often treat video as a few sampled images and audio as a fixed spectrogram. They miss frame-level temporal alignment and speech understanding, which reduces performance on tasks that need fine-grained timing or speech+vision co-reasoning.

Main Contribution

FAVOR framework: frame-level audio-visual synchronisation, sliding windows, and alignment to LLM token space.

Causal Q-Former: adds a causal self-attention module to capture temporal causal relations across frames.

Key Findings

FAVOR substantially improves video QA accuracy on the evaluated AVEB split.

NumbersFAVOR 13B Video QA 49.3% vs InstructBLIP 13B 21.0% (Table 2)

Practical UseIf you need better video causal reasoning (what happens next, speech+visual cues), add frame-level synchronisation and causal attention; expect large absolute gains on similar QA sets.

Evidence RefTable 2, Section 5.1

FAVOR delivers strong audio-visual matching (AVM).

NumbersFAVOR 13B AVM 77.1% vs Video-LLaMA 7B 52.3% (Table 3)

Practical UseFor tasks that require checking whether audio describes a video or image, synchronised frame-level fusion is effective and can raise matching accuracy substantially.

Evidence RefTable 3, Section 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy49.3%InstructBLIP 13B 21.0%+28.3 ppAVEB Video QA test (NExT-QA subset)Table 2 shows FAVOR 13B 49.3% vs InstructBLIP 13B 21.0%Table 2
Accuracy77.1%Video-LLaMA 7B 52.3%+24.8 ppAVEB AVMTable 3 FAVOR 13B 77.1% vs Video-LLaMA 52.3%Table 3

What To Try In 7 Days

Prototype frame-level sync: sample video at 2 FPS and align audio frames per video frame.

Add a causal self-attention block over per-frame multimodal tokens to capture temporal order.

Use sliding windows to control token budget and preserve local causality for long clips.

Agent Features

Tool Use
LoRA
Architectures
causal Q-Formersliding-window query projectionLoRA

Optimization Features

Token Efficiency
controls output queries via window size and N (queries per window)
Model Optimization
LoRA
Training Optimization
multi-task instruction fine-tuningdiversity loss to spread query outputssliding-window training to handle variable length
Inference Optimization
sliding windows to trade tokens vs coverageAccuracy

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Higher frame rates and larger windows increase token count and compute quickly.

Diversity loss must be tuned; large weight causes hallucination and worse WER.

When Not To Use

If you only need a standalone state-of-the-art ASR or image-only system; specialised models may be simpler.

When compute budget cannot handle additional LLM tokens from high FPS or long windows.

Failure Modes

Hallucination increases with aggressive diversity loss (high λ).

Missing synchronisation reduces co-reasoning and can drop ISQA/AVM performance sharply.

Core Entities

Models

LoRAVicuna-7BVicuna-13BWhisper large-v2InstructBLIP (BLIP-2 + Q-Former)Video-LLaMA

Metrics

WERSPIDErCIDErMETEORAccuracy

Datasets

AVEB (this paper)LibriSpeechAudioCapsFlickr30kTextVQAGQANExT-QAHow2Ego4DVGGSSSpokenCOCOCOCO

Benchmarks

AVEB