Fine-tune LLaMA2 with context and video descriptions to improve emotion recognition in conversations

Overview

Decision SnapshotReady For Pilot

The method is practical: uses open LLM base, LoRA, and public ERC datasets. Results are strong on standard benchmarks, but EmoryNLP scores remain modest and video-description quality can limit gains.

Citations9

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, Jing Qin

Links

Abstract / PDF / Data

Why It Matters For Business

Fine-tuning an open 7B LLM with emotion and context data gives SOTA emotion detection while staying cheap to train, enabling faster builds of emotion-aware agents and analytics.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

The authors fine-tune open-source LLaMA-family models (7B) with curated multimodal emotion dialogues and video descriptions to create DialogueLLM. The model, trained with LoRA, integrates one or two previous utterances plus automatically generated video descriptions as prompt context. On three emotion-recognition-in-conversation (ERC) benchmarks (MELD, IEMOCAP, EmoryNLP) it reaches state-of-the-art scores versus 15 baselines and unfine-tuned LLMs. The training is reproducible: DialogueLLM-7B can be trained with LoRA on a single 40GB A100 in about 5 hours.

Problem Statement

General LLMs lack task-specific emotional knowledge and rarely use video cues. That limits accuracy on emotion recognition in conversations (ERC). This paper asks whether fine-tuning an open LLM with context plus visual descriptions improves ERC and remains practical to reproduce.

Main Contribution

An emotion- and context-tuned LLM (DialogueLLM) built by instruction-finetuning LLaMA 2-7B on 2,411 multimodal dialogues (≈24.3K utterances).

Use of automatic video descriptions (ERNIE Bot) as supplementary knowledge in instruction prompts to inject visual cues without multimodal model retraining.

Key Findings

DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.

NumbersMELD Acc 71.96%, F1 71.90; IEMOCAP Acc 70.62%, F1 69.93; EmoryNLP Acc 41.88%, F1 40.05

Practical UseFine-tuning an open 7B LLM with task-specific emotion/context data can outperform many specialized ERC models on standard benchmarks; try instruction-finetuning before building new architectures.

Evidence RefTable 3; Results section

Visual descriptions substantially contributed to MELD performance.

NumbersMELD Acc drops from 71.91% to 60.80% when video descriptions removed (≈11.1 pp loss)

Practical UseAdd automatic video-to-text descriptions to prompts when video is available; they can materially improve emotion labels without training a multimodal fusion model.

Evidence RefTable 4 (ablation); Ablation Test section

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	71.96%	SOTA baselines	improvement reported	MELD test	Table 3; Results section	Table 3
Weighted-F1	71.90%	SOTA baselines	improvement reported	MELD test	Table 3; Results section	Table 3

What To Try In 7 Days

Use LoRA to fine-tune an open 7B LLM on your labeled ERC data; expect to run on one 40GB A100 in hours.

Add automatic video-to-text descriptions to prompts when you have video to boost accuracy without multimodal retraining.

Limit prompt context to 1–2 prior turns; longer histories can add noise and cost.

Optimization Features

Token Efficiency

Max context length set to 4096 tokens

Model Optimization

LoRA

System Optimization

LoRA

Training Optimization

AdamW optimizer with cosine LR scheduleSwiGLU activationBatch size 128, gradient clipping 1.0

Inference Optimization

Use short context (1–2 utterances) to reduce compute

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/declare-lab/MELD.https://sail.usc.edu/iemocap/.https://github.com/emorynlp

Risks & Boundaries

Limitations

Relies on automatically generated video descriptions; errors in descriptions can mislead predictions.

Does not use explicit speaker modeling; speaker-specific traits are unaddressed.

When Not To Use

When you lack video or reliable visual descriptions and cannot generate quality text from video.

If you need fine-grained speaker modeling or long-range conversational state across many turns.

Failure Modes

Model over-predicts 'neutral' on imbalanced datasets (neutral-heavy distributions).

Confuses closely related emotions (anger vs disgust; surprise vs excitement).

Core Entities

Models

DialogueLLM-7BLLaMA 2-7BLLaMA-7BAlpacaGPT-4

Metrics

AccuracyWeighted-F1

Datasets

MELDIEMOCAPEmoryNLPSECEU

Benchmarks

Emotion Recognition in Conversations (ERC)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.

Visual descriptions substantially contributed to MELD performance.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding