A multimodal preference-tuning recipe (AVEm-DPO) that cuts emotion-related hallucinations and spurious cue links in audiovisual LLMs

February 4, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani

Links

Abstract / PDF

Why It Matters For Business

If your product interprets emotion from audio+video, fine-tuning with AVEm-DPO improves correctness and cuts hallucinated justifications, making downstream outputs more trustworthy in user-facing interfaces.

Summary TLDR

The paper introduces EmoReAlM, a 4,000-question benchmark to test audiovisual emotion reasoning and hallucination, and AVEm-DPO, a multimodal direct preference optimization method that (1) enforces prompt-conditioned modality grounding and (2) penalizes text-only priors. On EmoReAlM and standard emotion datasets (DFEW, RAVDESS, EMER), AVEm-DPO improves zero-shot reasoning and reduces hallucinated or spuriously linked cues versus baseline MLLMs. The method trains with ~41k auto-generated preference pairs and keeps language-model capabilities via lightweight LoRA tuning.

Problem Statement

Multimodal large language models (MLLMs) make two key errors for emotion understanding: they (1) ground emotions on irrelevant audiovisual cues (spurious associations) and (2) invent audiovisual cues to justify emotions (hallucinations). Existing benchmarks lack focused tests for these problems and naive preference tuning can ignore multimodal inputs and overfit text prompts.

Main Contribution

EmoReAlM: a human-verified benchmark of 4,000 multiple-choice questions testing audiovisual emotion reasoning, modality agreement, spurious cue associations and hallucination stress tests.

AVEm-DPO: a multimodal direct preference optimization method that (i) builds prompt-based modality preferences to force grounding in the relevant modality and (ii) adds text-prior debiasing to penalize responses explainable from text alone.

An empirical study showing AVEm-DPO reduces hallucinations and spurious associations and improves zero-shot emotion reasoning on EmoReAlM, DFEW, RAVDESS and EMER.

Key Findings

AVEm-DPO gives large zero-shot gains on the EmoReAlM benchmark.

NumbersAudio acc 69.2% -> 77.9%; Visual acc 85.3% -> 92.5%

AVEm-DPO drastically reduces hallucinated and spurious cue errors in stress tests.

NumbersStress-test F1 audio 50.3% -> 80.9%; visual F1 59.9% -> 94.6%

Text-prior debiasing (TPD) is essential to stop hallucinations.

NumbersHallucination metric drops: AVEm-DPO 97.6 -> w/o TPD 77.8 (relative drop)

Preference training data scale and quality: ~41k auto-generated pairs and human spot checks.

Numbers41,687 preference samples; human check (n=1000) majority-correct rates: chosen 91.2%, video-relevant 89.5%, emotion-rev

AVEm-DPO improves standard emotion recognition (zero-shot) on datasets.

NumbersDFEW UAR: 56.78% -> 58.54%; RAVDESS UAR: 53.59% -> 58.66%

Results

Accuracy

Value77.9%

BaselineOur base 69.2%

Accuracy

Value92.5%

BaselineOur base 85.3%

EmoReAlM Modality Agreement (F1)

Value60.0

BaselineOur base 34.6

Stress-Test (audio F1)

Value80.9

BaselineOur base 50.3

DFEW Unweighted Average Recall (UAR)

Value58.54%

BaselineOur base 56.78%

Who Should Care

What To Try In 7 Days

Run the EmoReAlM benchmark on your multimodal model to measure spurious cues and hallucination rates.

Generate ~10k preference pairs with an LLM, validate a small subset, and apply DPO-style tuning with a LoRA adapter.

Add a text-only penalty during preference tuning to reduce language-driven hallucinations.

Agent Features

Tool Use

  • LLMs (GPT-4o, Gemini-2.5) for data creation and evaluation

Frameworks

  • Direct Preference Optimization (DPO)
  • Bradley-Terry preference model
  • LoRA

Optimization Features

Training Optimization

  • Direct preference optimization with multimodal and response-level preferences
  • LoRA

Reproducibility

Data Urls

  • EmoReAlM will be released at https://avere-iclr.github.io; underlying videos must be obtained from original dataset sources

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • EmoReAlM inherits cultural and source biases from DFEW (authors note).
  • Benchmark and experiments use short clips (~2–10s); long-form video emotion reasoning is untested.
  • Model still struggles with some classes (e.g., disgust) due to limited samples.
  • Preference data are automatically generated; full manual verification is partial (spot checks only).

When Not To Use

  • Do not deploy in sensitive real-world contexts (healthcare, hiring, law enforcement) without domain validation.
  • Avoid relying on this approach for long-form video emotion inference without further testing.

Failure Modes

  • Residual hallucinations when text priors are strong or preference data is noisy.
  • Spurious audio cue associations persist in some edge cases.
  • Performance depends on quality of auto-generated preference pairs and LLM annotators.
  • Class imbalance or scarce emotion labels (e.g., disgust) can cause poor recall for some emotions.

Core Entities

Models

  • AVEm-DPO
  • DPO
  • Vista-DPO
  • EmotionLLaMA
  • EmotionLLaMA⋆
  • Our base model
  • Qwen-2.5 Omni
  • VITA-1.5
  • VideoLLaMA2

Metrics

  • Accuracy
  • F1
  • Unweighted Average Recall (UAR)
  • Weighted Average Recall (WAR)
  • Precision
  • Recall
  • Spurious cue score
  • Hallucination score

Datasets

  • EmoReAlM
  • DFEW
  • RAVDESS
  • EMER
  • MAFW
  • MER2025
  • MER2023

Benchmarks

  • EmoReAlM
  • AVHBench (mentioned)