Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If your product interprets emotion from audio+video, fine-tuning with AVEm-DPO improves correctness and cuts hallucinated justifications, making downstream outputs more trustworthy in user-facing interfaces.
Summary TLDR
The paper introduces EmoReAlM, a 4,000-question benchmark to test audiovisual emotion reasoning and hallucination, and AVEm-DPO, a multimodal direct preference optimization method that (1) enforces prompt-conditioned modality grounding and (2) penalizes text-only priors. On EmoReAlM and standard emotion datasets (DFEW, RAVDESS, EMER), AVEm-DPO improves zero-shot reasoning and reduces hallucinated or spuriously linked cues versus baseline MLLMs. The method trains with ~41k auto-generated preference pairs and keeps language-model capabilities via lightweight LoRA tuning.
Problem Statement
Multimodal large language models (MLLMs) make two key errors for emotion understanding: they (1) ground emotions on irrelevant audiovisual cues (spurious associations) and (2) invent audiovisual cues to justify emotions (hallucinations). Existing benchmarks lack focused tests for these problems and naive preference tuning can ignore multimodal inputs and overfit text prompts.
Main Contribution
EmoReAlM: a human-verified benchmark of 4,000 multiple-choice questions testing audiovisual emotion reasoning, modality agreement, spurious cue associations and hallucination stress tests.
AVEm-DPO: a multimodal direct preference optimization method that (i) builds prompt-based modality preferences to force grounding in the relevant modality and (ii) adds text-prior debiasing to penalize responses explainable from text alone.
An empirical study showing AVEm-DPO reduces hallucinations and spurious associations and improves zero-shot emotion reasoning on EmoReAlM, DFEW, RAVDESS and EMER.
Key Findings
AVEm-DPO gives large zero-shot gains on the EmoReAlM benchmark.
AVEm-DPO drastically reduces hallucinated and spurious cue errors in stress tests.
Text-prior debiasing (TPD) is essential to stop hallucinations.
Preference training data scale and quality: ~41k auto-generated pairs and human spot checks.
AVEm-DPO improves standard emotion recognition (zero-shot) on datasets.
Results
Accuracy
Accuracy
EmoReAlM Modality Agreement (F1)
Stress-Test (audio F1)
DFEW Unweighted Average Recall (UAR)
Who Should Care
What To Try In 7 Days
Run the EmoReAlM benchmark on your multimodal model to measure spurious cues and hallucination rates.
Generate ~10k preference pairs with an LLM, validate a small subset, and apply DPO-style tuning with a LoRA adapter.
Add a text-only penalty during preference tuning to reduce language-driven hallucinations.
Agent Features
Tool Use
- LLMs (GPT-4o, Gemini-2.5) for data creation and evaluation
Frameworks
- Direct Preference Optimization (DPO)
- Bradley-Terry preference model
- LoRA
Optimization Features
Training Optimization
- Direct preference optimization with multimodal and response-level preferences
- LoRA
Reproducibility
Code Urls
Data Urls
- EmoReAlM will be released at https://avere-iclr.github.io; underlying videos must be obtained from original dataset sources
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- EmoReAlM inherits cultural and source biases from DFEW (authors note).
- Benchmark and experiments use short clips (~2–10s); long-form video emotion reasoning is untested.
- Model still struggles with some classes (e.g., disgust) due to limited samples.
- Preference data are automatically generated; full manual verification is partial (spot checks only).
When Not To Use
- Do not deploy in sensitive real-world contexts (healthcare, hiring, law enforcement) without domain validation.
- Avoid relying on this approach for long-form video emotion inference without further testing.
Failure Modes
- Residual hallucinations when text priors are strong or preference data is noisy.
- Spurious audio cue associations persist in some edge cases.
- Performance depends on quality of auto-generated preference pairs and LLM annotators.
- Class imbalance or scarce emotion labels (e.g., disgust) can cause poor recall for some emotions.
Core Entities
Models
- AVEm-DPO
- DPO
- Vista-DPO
- EmotionLLaMA
- EmotionLLaMA⋆
- Our base model
- Qwen-2.5 Omni
- VITA-1.5
- VideoLLaMA2
Metrics
- Accuracy
- F1
- Unweighted Average Recall (UAR)
- Weighted Average Recall (WAR)
- Precision
- Recall
- Spurious cue score
- Hallucination score
Datasets
- EmoReAlM
- DFEW
- RAVDESS
- EMER
- MAFW
- MER2025
- MER2023
Benchmarks
- EmoReAlM
- AVHBench (mentioned)

