Fine-tune LLaMA2 with context and video descriptions to improve emotion recognition in conversations

October 17, 20236 min

Overview

Decision SnapshotReady For Pilot

The method is practical: uses open LLM base, LoRA, and public ERC datasets. Results are strong on standard benchmarks, but EmoryNLP scores remain modest and video-description quality can limit gains.

Citations9

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, Jing Qin

Links

Abstract / PDF / Data

Why It Matters For Business

Fine-tuning an open 7B LLM with emotion and context data gives SOTA emotion detection while staying cheap to train, enabling faster builds of emotion-aware agents and analytics.

Who Should Care

Summary TLDR

The authors fine-tune open-source LLaMA-family models (7B) with curated multimodal emotion dialogues and video descriptions to create DialogueLLM. The model, trained with LoRA, integrates one or two previous utterances plus automatically generated video descriptions as prompt context. On three emotion-recognition-in-conversation (ERC) benchmarks (MELD, IEMOCAP, EmoryNLP) it reaches state-of-the-art scores versus 15 baselines and unfine-tuned LLMs. The training is reproducible: DialogueLLM-7B can be trained with LoRA on a single 40GB A100 in about 5 hours.

Problem Statement

General LLMs lack task-specific emotional knowledge and rarely use video cues. That limits accuracy on emotion recognition in conversations (ERC). This paper asks whether fine-tuning an open LLM with context plus visual descriptions improves ERC and remains practical to reproduce.

Main Contribution

An emotion- and context-tuned LLM (DialogueLLM) built by instruction-finetuning LLaMA 2-7B on 2,411 multimodal dialogues (≈24.3K utterances).

Use of automatic video descriptions (ERNIE Bot) as supplementary knowledge in instruction prompts to inject visual cues without multimodal model retraining.

Key Findings

DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.

NumbersMELD Acc 71.96%, F1 71.90; IEMOCAP Acc 70.62%, F1 69.93; EmoryNLP Acc 41.88%, F1 40.05

Practical UseFine-tuning an open 7B LLM with task-specific emotion/context data can outperform many specialized ERC models on standard benchmarks; try instruction-finetuning before building new architectures.

Evidence RefTable 3; Results section

Visual descriptions substantially contributed to MELD performance.

NumbersMELD Acc drops from 71.91% to 60.80% when video descriptions removed (≈11.1 pp loss)

Practical UseAdd automatic video-to-text descriptions to prompts when video is available; they can materially improve emotion labels without training a multimodal fusion model.

Evidence RefTable 4 (ablation); Ablation Test section

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy71.96%SOTA baselinesimprovement reportedMELD testTable 3; Results sectionTable 3
Weighted-F171.90%SOTA baselinesimprovement reportedMELD testTable 3; Results sectionTable 3

What To Try In 7 Days

Use LoRA to fine-tune an open 7B LLM on your labeled ERC data; expect to run on one 40GB A100 in hours.

Add automatic video-to-text descriptions to prompts when you have video to boost accuracy without multimodal retraining.

Limit prompt context to 1–2 prior turns; longer histories can add noise and cost.

Optimization Features

Token Efficiency
Max context length set to 4096 tokens
Model Optimization
LoRA
System Optimization
LoRA
Training Optimization
AdamW optimizer with cosine LR scheduleSwiGLU activationBatch size 128, gradient clipping 1.0
Inference Optimization
Use short context (1–2 utterances) to reduce compute

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on automatically generated video descriptions; errors in descriptions can mislead predictions.

Does not use explicit speaker modeling; speaker-specific traits are unaddressed.

When Not To Use

When you lack video or reliable visual descriptions and cannot generate quality text from video.

If you need fine-grained speaker modeling or long-range conversational state across many turns.

Failure Modes

Model over-predicts 'neutral' on imbalanced datasets (neutral-heavy distributions).

Confuses closely related emotions (anger vs disgust; surprise vs excitement).

Core Entities

Models

DialogueLLM-7BLLaMA 2-7BLLaMA-7BAlpacaGPT-4

Metrics

AccuracyWeighted-F1

Datasets

MELDIEMOCAPEmoryNLPSECEU

Benchmarks

Emotion Recognition in Conversations (ERC)