Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
FAVOR enables LLMs to reason over speech, sounds and video together at frame level, improving video QA and matching tasks that power search, content moderation, AV indexing, and multimedia assistants.
Summary TLDR
This paper introduces FAVOR, a method that fuses audio (speech + sounds) and visual frames at high temporal resolution and maps them into a language model input space. Key pieces are: a temporal synchronisation step, sliding windows, and a causal Q-Former (causal attention) that produces compact query tokens per window. A diversity loss reduces redundancy. On the AVEB benchmark (11 tasks, single- and cross-modal), FAVOR matches single-modality baselines and yields large gains on cross-modal reasoning: e.g., ~49.3% Video QA accuracy vs 21.0% for InstructBLIP (absolute +28.3 points) on the evaluated Video QA split. The method is computationally tunable (window size, FPS) and benefits speech+视觉
Problem Statement
Existing multimodal LLMs often treat video as a few sampled images and audio as a fixed spectrogram. They miss frame-level temporal alignment and speech understanding, which reduces performance on tasks that need fine-grained timing or speech+vision co-reasoning.
Main Contribution
FAVOR framework: frame-level audio-visual synchronisation, sliding windows, and alignment to LLM token space.
Causal Q-Former: adds a causal self-attention module to capture temporal causal relations across frames.
Diversity loss: penalises redundant query outputs so windows yield diverse information.
AVEB benchmark: 11 tasks (6 single-modal, 5 cross-modal) to test audio-visual perception and co-reasoning.
Key Findings
FAVOR substantially improves video QA accuracy on the evaluated AVEB split.
FAVOR delivers strong audio-visual matching (AVM).
Audio-visual sound-source detection improved but gains depend on baseline.
Speech recognition accuracy is competitive with strong ASR baselines.
Ablations show causal encoder, sliding windows and synchronisation are key.
Diversity loss trades off coverage vs hallucination.
Results
Accuracy
Accuracy
Accuracy
ASR WER (audio-only)
AVSR WER (audio-visual speech recognition)
Who Should Care
What To Try In 7 Days
Prototype frame-level sync: sample video at 2 FPS and align audio frames per video frame.
Add a causal self-attention block over per-frame multimodal tokens to capture temporal order.
Use sliding windows to control token budget and preserve local causality for long clips.
Agent Features
Tool Use
- LoRA
Architectures
- causal Q-Former
- sliding-window query projection
- LoRA
Optimization Features
Token Efficiency
- controls output queries via window size and N (queries per window)
Model Optimization
- LoRA
Training Optimization
- multi-task instruction fine-tuning
- diversity loss to spread query outputs
- sliding-window training to handle variable length
Inference Optimization
- sliding windows to trade tokens vs coverage
- Accuracy
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher frame rates and larger windows increase token count and compute quickly.
- Diversity loss must be tuned; large weight causes hallucination and worse WER.
- Model depends on pre-trained encoders (Whisper, InstructBLIP) and paired audio-visual data for best results.
- Not all AVEB splits are trained (some zero-shot), so performance mixes trained and zero-shot evaluations.
When Not To Use
- If you only need a standalone state-of-the-art ASR or image-only system; specialised models may be simpler.
- When compute budget cannot handle additional LLM tokens from high FPS or long windows.
- If training data lacks synchronous audio-visual pairs; synchronisation is central to gains.
Failure Modes
- Hallucination increases with aggressive diversity loss (high λ).
- Missing synchronisation reduces co-reasoning and can drop ISQA/AVM performance sharply.
- Too-large windows hurt ASR monotonic alignment and increase deletion errors.
Core Entities
Models
- LoRA
- Vicuna-7B
- Vicuna-13B
- Whisper large-v2
- InstructBLIP (BLIP-2 + Q-Former)
- Video-LLaMA
Metrics
- WER
- SPIDEr
- CIDEr
- METEOR
- Accuracy
Datasets
- AVEB (this paper)
- LibriSpeech
- AudioCaps
- Flickr30k
- TextVQA
- GQA
- NExT-QA
- How2
- Ego4D
- VGGSS
- SpokenCOCO
- COCO
Benchmarks
- AVEB

