Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
9
Why It Matters For Business
Fine-tuning an open 7B LLM with emotion and context data gives SOTA emotion detection while staying cheap to train, enabling faster builds of emotion-aware agents and analytics.
Summary TLDR
The authors fine-tune open-source LLaMA-family models (7B) with curated multimodal emotion dialogues and video descriptions to create DialogueLLM. The model, trained with LoRA, integrates one or two previous utterances plus automatically generated video descriptions as prompt context. On three emotion-recognition-in-conversation (ERC) benchmarks (MELD, IEMOCAP, EmoryNLP) it reaches state-of-the-art scores versus 15 baselines and unfine-tuned LLMs. The training is reproducible: DialogueLLM-7B can be trained with LoRA on a single 40GB A100 in about 5 hours.
Problem Statement
General LLMs lack task-specific emotional knowledge and rarely use video cues. That limits accuracy on emotion recognition in conversations (ERC). This paper asks whether fine-tuning an open LLM with context plus visual descriptions improves ERC and remains practical to reproduce.
Main Contribution
An emotion- and context-tuned LLM (DialogueLLM) built by instruction-finetuning LLaMA 2-7B on 2,411 multimodal dialogues (≈24.3K utterances).
Use of automatic video descriptions (ERNIE Bot) as supplementary knowledge in instruction prompts to inject visual cues without multimodal model retraining.
Show SOTA results on three ERC benchmarks (MELD, IEMOCAP, EmoryNLP) while keeping training cheap via LoRA (5 hours on one 40GB A100).
Key Findings
DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.
Visual descriptions substantially contributed to MELD performance.
LoRA low-rank tuning mattered for small instruction dataset.
Short context (1–2 previous utterances) helps; too long context hurts or adds noise.
Training cost is small and reproducible on commodity server GPUs.
Results
Accuracy
Weighted-F1
Accuracy
Weighted-F1
Accuracy
Weighted-F1
Who Should Care
What To Try In 7 Days
Use LoRA to fine-tune an open 7B LLM on your labeled ERC data; expect to run on one 40GB A100 in hours.
Add automatic video-to-text descriptions to prompts when you have video to boost accuracy without multimodal retraining.
Limit prompt context to 1–2 prior turns; longer histories can add noise and cost.
Optimization Features
Token Efficiency
- Max context length set to 4096 tokens
Model Optimization
- LoRA
System Optimization
- LoRA
Training Optimization
- AdamW optimizer with cosine LR schedule
- SwiGLU activation
- Batch size 128, gradient clipping 1.0
Inference Optimization
- Use short context (1–2 utterances) to reduce compute
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on automatically generated video descriptions; errors in descriptions can mislead predictions.
- Does not use explicit speaker modeling; speaker-specific traits are unaddressed.
- Smaller gains on some datasets (EmoryNLP) and class imbalance leads to neutral bias.
When Not To Use
- When you lack video or reliable visual descriptions and cannot generate quality text from video.
- If you need fine-grained speaker modeling or long-range conversational state across many turns.
- When absolute interpretability of multimodal fusion is required.
Failure Modes
- Model over-predicts 'neutral' on imbalanced datasets (neutral-heavy distributions).
- Confuses closely related emotions (anger vs disgust; surprise vs excitement).
- Adding many few-shot examples or very long context can reduce performance due to noise.
Core Entities
Models
- DialogueLLM-7B
- LLaMA 2-7B
- LLaMA-7B
- Alpaca
- GPT-4
Metrics
- Accuracy
- Weighted-F1
Datasets
- MELD
- IEMOCAP
- EmoryNLP
- SECEU
Benchmarks
- Emotion Recognition in Conversations (ERC)

