FAVOR: frame-level audio+visual fusion and causal Q-Former to help LLMs understand speech, sounds and video together

Overview

Decision SnapshotNeeds Validation

The approach shows clear gains on cross-modal tasks and ablations confirm key components; however code/model checkpoints are not yet released and experiments rely on many public datasets and compute.

Citations5

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

FAVOR enables LLMs to reason over speech, sounds and video together at frame level, improving video QA and matching tasks that power search, content moderation, AV indexing, and multimedia assistants.

Who Should Care

ML Engineer Product Manager Data Scientist

Summary TLDR

This paper introduces FAVOR, a method that fuses audio (speech + sounds) and visual frames at high temporal resolution and maps them into a language model input space. Key pieces are: a temporal synchronisation step, sliding windows, and a causal Q-Former (causal attention) that produces compact query tokens per window. A diversity loss reduces redundancy. On the AVEB benchmark (11 tasks, single- and cross-modal), FAVOR matches single-modality baselines and yields large gains on cross-modal reasoning: e.g., ~49.3% Video QA accuracy vs 21.0% for InstructBLIP (absolute +28.3 points) on the evaluated Video QA split. The method is computationally tunable (window size, FPS) and benefits speech+视觉

Problem Statement

Existing multimodal LLMs often treat video as a few sampled images and audio as a fixed spectrogram. They miss frame-level temporal alignment and speech understanding, which reduces performance on tasks that need fine-grained timing or speech+vision co-reasoning.

Main Contribution

FAVOR framework: frame-level audio-visual synchronisation, sliding windows, and alignment to LLM token space.

Causal Q-Former: adds a causal self-attention module to capture temporal causal relations across frames.

Key Findings

FAVOR substantially improves video QA accuracy on the evaluated AVEB split.

NumbersFAVOR 13B Video QA 49.3% vs InstructBLIP 13B 21.0% (Table 2)

Practical UseIf you need better video causal reasoning (what happens next, speech+visual cues), add frame-level synchronisation and causal attention; expect large absolute gains on similar QA sets.

Evidence RefTable 2, Section 5.1

FAVOR delivers strong audio-visual matching (AVM).

NumbersFAVOR 13B AVM 77.1% vs Video-LLaMA 7B 52.3% (Table 3)

Practical UseFor tasks that require checking whether audio describes a video or image, synchronised frame-level fusion is effective and can raise matching accuracy substantially.

Evidence RefTable 3, Section 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	49.3%	InstructBLIP 13B 21.0%	+28.3 pp	AVEB Video QA test (NExT-QA subset)	Table 2 shows FAVOR 13B 49.3% vs InstructBLIP 13B 21.0%	Table 2
Accuracy	77.1%	Video-LLaMA 7B 52.3%	+24.8 pp	AVEB AVM	Table 3 FAVOR 13B 77.1% vs Video-LLaMA 52.3%	Table 3

What To Try In 7 Days

Prototype frame-level sync: sample video at 2 FPS and align audio frames per video frame.

Add a causal self-attention block over per-frame multimodal tokens to capture temporal order.

Use sliding windows to control token budget and preserve local causality for long clips.

Agent Features

Tool Use

LoRA

Architectures

causal Q-Formersliding-window query projectionLoRA

Optimization Features

Token Efficiency

controls output queries via window size and N (queries per window)

Model Optimization

LoRA

Training Optimization

multi-task instruction fine-tuningdiversity loss to spread query outputssliding-window training to handle variable length

Inference Optimization

sliding windows to trade tokens vs coverageAccuracy

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Higher frame rates and larger windows increase token count and compute quickly.

Diversity loss must be tuned; large weight causes hallucination and worse WER.

When Not To Use

If you only need a standalone state-of-the-art ASR or image-only system; specialised models may be simpler.

When compute budget cannot handle additional LLM tokens from high FPS or long windows.

Failure Modes

Hallucination increases with aggressive diversity loss (high λ).

Missing synchronisation reduces co-reasoning and can drop ISQA/AVM performance sharply.

Core Entities

Models

LoRAVicuna-7BVicuna-13BWhisper large-v2InstructBLIP (BLIP-2 + Q-Former)Video-LLaMA

Metrics

WERSPIDErCIDErMETEORAccuracy

Datasets

AVEB (this paper)LibriSpeechAudioCapsFlickr30kTextVQAGQANExT-QAHow2Ego4DVGGSSSpokenCOCOCOCO

Benchmarks

AVEB

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FAVOR substantially improves video QA accuracy on the evaluated AVEB split.

FAVOR delivers strong audio-visual matching (AVM).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

WavLLM: dual-encoder LLaMA with prompt-aware LoRA for robust multi-task speech understanding

Key finding

SpeechSSM: a state-space spoken LM that generates coherent multi-minute speech

Key finding

MoST: a modality-aware Mixture-of-Experts that mixes speech and text in one LLM

Key finding

Zero-shot end-to-end spoken medical QA that matches cascades while using far fewer resources

Key finding

LISTEN: use LLM-synthesized negative examples to cut audio hallucinations while training only a small audio adapter

Key finding