Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
55
Why It Matters For Business
Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.
Summary TLDR
This paper runs a broad, practical evaluation of LLMs (Flan-T5, Flan-UL2, text-davinci-003, ChatGPT) vs a fine-tuned T5 (770M) across 13 sentiment tasks on 26 datasets. Findings: LLMs match or nearly match fine-tuned models on simple sentiment classification zero-shot, fail at structured aspect-level extraction, and strongly beat small models in few-shot settings. The authors introduce SENTIEVAL, a unified prompt-robust benchmark and show prompt design, format compliance, and context-length limits remain key blockers.
Problem Statement
Can current large language models reliably solve the full range of sentiment-analysis problems — from basic polarity classification to aspect-level extraction and nuanced subjective analysis — and how do they compare to smaller, task-trained models in zero-shot and few-shot settings?
Main Contribution
Systematic evaluation of LLMs on 13 sentiment tasks across 26 datasets, covering sentiment classification, ABSA, and multifaceted subjective tasks.
Empirical finding that LLMs perform well zero-shot on simple classification but underperform on structured, fine-grained tasks.
Demonstration that LLMs outperform small fine-tuned models in few-shot in-context learning when labeled data is scarce.
Identification of practical evaluation issues: prompt sensitivity, format compliance, and context-length limits.
Release of SENTIEVAL: a benchmark with diverse prompts and mixed few-shot/zero-shot queries to produce more robust comparisons.
Key Findings
LLMs match fine-tuned small models on simple sentiment classification in zero-shot.
LLMs lag on fine-grained aspect-based sentiment extraction without task-specific training.
LLMs outperform small models in few-shot settings when labeled data is scarce.
Prompt design and output format strongly affect results, especially for structured tasks.
ChatGPT shows weaker performance on hate/irony/offensive detection compared to other LLMs.
A unified, prompt-diverse benchmark (SENTIEVAL) reveals persistent gaps on complex tasks.
Results
SC average (zero-shot)
Yelp-5 (document-level, zero-shot)
Comparative opinions (CS19, zero-shot)
ABSA average (zero-shot micro-F1)
SENTIEVAL overall (exact match)
Who Should Care
What To Try In 7 Days
Run a quick zero-shot pilot: apply ChatGPT or Flan-UL2 to your binary sentiment labels and compare to existing classifiers.
If you have <100 labels per class, test LLM few-shot prompts before training a specialist model.
For aspect extraction, run LLM outputs through a small validation set or human review before trusting automation.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High sensitivity to prompt wording; single prompt can misrepresent model ability.
- LLMs often fail to produce the exact structured format required by ABSA metrics.
- Context length limits constrain few-shot scaling and document-level tasks.
- Observed RLHF over-alignment can reduce detection accuracy for offensive or hateful content.
When Not To Use
- When you need exact, structured aspect/opinion triples or quadruples without human validation.
- As the sole detector for safety-sensitive content (hate/offensive) without specialized controls.
- When long documents exceed the model's few-shot context window and you need end-to-end fine-grained labeling.
Failure Modes
- Outputs that violate the required format, causing automatic-evaluation penalties.
- Prompt-induced variance: different natural phrasings can change results dramatically.
- Over-alignment bias: reduced detection of hate/offensive content in some LLMs.
- Degraded performance when few-shot context becomes too long or noisy.
Core Entities
Models
- Flan-T5 (13B)
- Flan-UL2 (20B)
- text-davinci-003 (text-003, 175B)
- ChatGPT (gpt-3.5-turbo)
- T5-large (770M, fine-tuned SLM baseline)
Metrics
- Accuracy
- micro_f1
- macro_f1
- f1(irony)
- exact-match (SENTIEVAL)
Datasets
- IMDb
- Yelp-2
- Yelp-5
- MR
- SST-2
- SST-5
- Lap14
- Rest14
- Rest15
- Rest16
- Laptop14
- UABSA (SemEval)
- ASTE datasets
- ASQP (Rest15/Rest16)
- HatEval
- Irony18
- OffensEval
- Stance16
- CS19 (comparative)
- Emotion20
- Implicit (Lap+Res)
Benchmarks
- SENTIEVAL

