Overview
Evidence comes from 26 datasets across 13 tasks and from mixed automatic and human evaluation; conclusions are robust for simple tasks but weaker for structured extraction and safety-sensitive tasks.
Citations55
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.
Who Should Care
Summary TLDR
This paper runs a broad, practical evaluation of LLMs (Flan-T5, Flan-UL2, text-davinci-003, ChatGPT) vs a fine-tuned T5 (770M) across 13 sentiment tasks on 26 datasets. Findings: LLMs match or nearly match fine-tuned models on simple sentiment classification zero-shot, fail at structured aspect-level extraction, and strongly beat small models in few-shot settings. The authors introduce SENTIEVAL, a unified prompt-robust benchmark and show prompt design, format compliance, and context-length limits remain key blockers.
Problem Statement
Can current large language models reliably solve the full range of sentiment-analysis problems — from basic polarity classification to aspect-level extraction and nuanced subjective analysis — and how do they compare to smaller, task-trained models in zero-shot and few-shot settings?
Main Contribution
Systematic evaluation of LLMs on 13 sentiment tasks across 26 datasets, covering sentiment classification, ABSA, and multifaceted subjective tasks.
Empirical finding that LLMs perform well zero-shot on simple classification but underperform on structured, fine-grained tasks.
Key Findings
LLMs match fine-tuned small models on simple sentiment classification in zero-shot.
LLMs lag on fine-grained aspect-based sentiment extraction without task-specific training.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| SC average (zero-shot) | ChatGPT 78.24 vs T5 80.65 (accuracy avg over SC datasets) | T5 large (fine-tuned) | -2.41 | SC datasets (see Table 1) | Table 2 average SC block | — |
| Yelp-5 (document-level, zero-shot) | ChatGPT 52.40% vs T5 65.60% (accuracy) | T5 large (fine-tuned) | -13.20 | Yelp-5 test set (sampled) | Table 2, SC Document-Level rows | — |
What To Try In 7 Days
Run a quick zero-shot pilot: apply ChatGPT or Flan-UL2 to your binary sentiment labels and compare to existing classifiers.
If you have <100 labels per class, test LLM few-shot prompts before training a specialist model.
For aspect extraction, run LLM outputs through a small validation set or human review before trusting automation.
Reproducibility
Risks & Boundaries
Limitations
High sensitivity to prompt wording; single prompt can misrepresent model ability.
LLMs often fail to produce the exact structured format required by ABSA metrics.
When Not To Use
When you need exact, structured aspect/opinion triples or quadruples without human validation.
As the sole detector for safety-sensitive content (hate/offensive) without specialized controls.
Failure Modes
Outputs that violate the required format, causing automatic-evaluation penalties.
Prompt-induced variance: different natural phrasings can change results dramatically.

