Fine-tune LLaMA2 with context and video descriptions to improve emotion recognition in conversations

October 17, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

9

Authors

Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, Jing Qin

Links

Abstract / PDF

Why It Matters For Business

Fine-tuning an open 7B LLM with emotion and context data gives SOTA emotion detection while staying cheap to train, enabling faster builds of emotion-aware agents and analytics.

Summary TLDR

The authors fine-tune open-source LLaMA-family models (7B) with curated multimodal emotion dialogues and video descriptions to create DialogueLLM. The model, trained with LoRA, integrates one or two previous utterances plus automatically generated video descriptions as prompt context. On three emotion-recognition-in-conversation (ERC) benchmarks (MELD, IEMOCAP, EmoryNLP) it reaches state-of-the-art scores versus 15 baselines and unfine-tuned LLMs. The training is reproducible: DialogueLLM-7B can be trained with LoRA on a single 40GB A100 in about 5 hours.

Problem Statement

General LLMs lack task-specific emotional knowledge and rarely use video cues. That limits accuracy on emotion recognition in conversations (ERC). This paper asks whether fine-tuning an open LLM with context plus visual descriptions improves ERC and remains practical to reproduce.

Main Contribution

An emotion- and context-tuned LLM (DialogueLLM) built by instruction-finetuning LLaMA 2-7B on 2,411 multimodal dialogues (≈24.3K utterances).

Use of automatic video descriptions (ERNIE Bot) as supplementary knowledge in instruction prompts to inject visual cues without multimodal model retraining.

Show SOTA results on three ERC benchmarks (MELD, IEMOCAP, EmoryNLP) while keeping training cheap via LoRA (5 hours on one 40GB A100).

Key Findings

DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.

NumbersMELD Acc 71.96%, F1 71.90; IEMOCAP Acc 70.62%, F1 69.93; EmoryNLP Acc 41.88%, F1 40.05

Visual descriptions substantially contributed to MELD performance.

NumbersMELD Acc drops from 71.91% to 60.80% when video descriptions removed (≈11.1 pp loss)

LoRA low-rank tuning mattered for small instruction dataset.

NumbersMELD Acc falls from 71.91% to 66.17% without LoRA (≈5.7 pp loss)

Short context (1–2 previous utterances) helps; too long context hurts or adds noise.

NumbersSmall improvement up to context=2; performance decreases for context ≥3 (see Fig. 9)

Training cost is small and reproducible on commodity server GPUs.

NumbersDialogueLLM-7B finetune via LoRA on a 40GB A100 in ~5 hours

Results

Accuracy

Value71.96%

BaselineSOTA baselines

Weighted-F1

Value71.90%

BaselineSOTA baselines

Accuracy

Value70.62%

BaselineSOTA baselines

Weighted-F1

Value69.93%

BaselineSOTA baselines

Accuracy

Value41.88%

BaselineSOTA baselines

Weighted-F1

Value40.05%

BaselineSOTA baselines

Who Should Care

What To Try In 7 Days

Use LoRA to fine-tune an open 7B LLM on your labeled ERC data; expect to run on one 40GB A100 in hours.

Add automatic video-to-text descriptions to prompts when you have video to boost accuracy without multimodal retraining.

Limit prompt context to 1–2 prior turns; longer histories can add noise and cost.

Optimization Features

Token Efficiency

  • Max context length set to 4096 tokens

Model Optimization

  • LoRA

System Optimization

  • LoRA

Training Optimization

  • AdamW optimizer with cosine LR schedule
  • SwiGLU activation
  • Batch size 128, gradient clipping 1.0

Inference Optimization

  • Use short context (1–2 utterances) to reduce compute

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on automatically generated video descriptions; errors in descriptions can mislead predictions.
  • Does not use explicit speaker modeling; speaker-specific traits are unaddressed.
  • Smaller gains on some datasets (EmoryNLP) and class imbalance leads to neutral bias.

When Not To Use

  • When you lack video or reliable visual descriptions and cannot generate quality text from video.
  • If you need fine-grained speaker modeling or long-range conversational state across many turns.
  • When absolute interpretability of multimodal fusion is required.

Failure Modes

  • Model over-predicts 'neutral' on imbalanced datasets (neutral-heavy distributions).
  • Confuses closely related emotions (anger vs disgust; surprise vs excitement).
  • Adding many few-shot examples or very long context can reduce performance due to noise.

Core Entities

Models

  • DialogueLLM-7B
  • LLaMA 2-7B
  • LLaMA-7B
  • Alpaca
  • GPT-4

Metrics

  • Accuracy
  • Weighted-F1

Datasets

  • MELD
  • IEMOCAP
  • EmoryNLP
  • SECEU

Benchmarks

  • Emotion Recognition in Conversations (ERC)