LLMs excel at simple sentiment tasks but struggle with fine-grained, structured sentiment extraction

Overview

Decision SnapshotNeeds Validation

Evidence comes from 26 datasets across 13 tasks and from mixed automatic and human evaluation; conclusions are robust for simple tasks but weaker for structured extraction and safety-sensitive tasks.

Citations55

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

This paper runs a broad, practical evaluation of LLMs (Flan-T5, Flan-UL2, text-davinci-003, ChatGPT) vs a fine-tuned T5 (770M) across 13 sentiment tasks on 26 datasets. Findings: LLMs match or nearly match fine-tuned models on simple sentiment classification zero-shot, fail at structured aspect-level extraction, and strongly beat small models in few-shot settings. The authors introduce SENTIEVAL, a unified prompt-robust benchmark and show prompt design, format compliance, and context-length limits remain key blockers.

Problem Statement

Can current large language models reliably solve the full range of sentiment-analysis problems — from basic polarity classification to aspect-level extraction and nuanced subjective analysis — and how do they compare to smaller, task-trained models in zero-shot and few-shot settings?

Main Contribution

Systematic evaluation of LLMs on 13 sentiment tasks across 26 datasets, covering sentiment classification, ABSA, and multifaceted subjective tasks.

Empirical finding that LLMs perform well zero-shot on simple classification but underperform on structured, fine-grained tasks.

Key Findings

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

NumbersChatGPT ≈97% of T5 performance on SC tasks (paper text).

Practical UseUse off-the-shelf LLMs (e.g., ChatGPT or Flan-UL2) for binary/trinary sentiment labeling to avoid labeling costs.

Evidence RefSec 4.3, Table 2

LLMs lag on fine-grained aspect-based sentiment extraction without task-specific training.

NumbersABSA zero-shot average: ChatGPT 37.09 vs fine-tuned T5 61.06 (micro-F1, evaluated datasets).

Practical UseDon't rely on zero-shot LLM outputs for structured aspect/opinion extraction; prefer task-trained models or human review.

Evidence RefTable 2 (ABSA rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SC average (zero-shot)	ChatGPT 78.24 vs T5 80.65 (accuracy avg over SC datasets)	T5 large (fine-tuned)	-2.41	SC datasets (see Table 1)	Table 2 average SC block	—
Yelp-5 (document-level, zero-shot)	ChatGPT 52.40% vs T5 65.60% (accuracy)	T5 large (fine-tuned)	-13.20	Yelp-5 test set (sampled)	Table 2, SC Document-Level rows	—

What To Try In 7 Days

Run a quick zero-shot pilot: apply ChatGPT or Flan-UL2 to your binary sentiment labels and compare to existing classifiers.

If you have <100 labels per class, test LLM few-shot prompts before training a specialist model.

For aspect extraction, run LLM outputs through a small validation set or human review before trusting automation.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/DAMO-NLP-SG/LLM-Sentiment

Data URLs

https://github.com/DAMO-NLP-SG/LLM-Sentiment

Risks & Boundaries

Limitations

High sensitivity to prompt wording; single prompt can misrepresent model ability.

LLMs often fail to produce the exact structured format required by ABSA metrics.

When Not To Use

When you need exact, structured aspect/opinion triples or quadruples without human validation.

As the sole detector for safety-sensitive content (hate/offensive) without specialized controls.

Failure Modes

Outputs that violate the required format, causing automatic-evaluation penalties.

Prompt-induced variance: different natural phrasings can change results dramatically.

Core Entities

Models

Flan-T5 (13B)Flan-UL2 (20B)text-davinci-003 (text-003, 175B)ChatGPT (gpt-3.5-turbo)T5-large (770M, fine-tuned SLM baseline)

Metrics

Accuracymicro_f1macro_f1f1(irony)exact-match (SENTIEVAL)

Datasets

IMDbYelp-2Yelp-5MRSST-2SST-5TwitterLap14Rest14Rest15Rest16Laptop14UABSA (SemEval)ASTE datasetsASQP (Rest15/Rest16)HatEvalIrony18OffensEvalStance16CS19 (comparative)Emotion20Implicit (Lap+Res)

Benchmarks

SENTIEVAL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

LLMs lag on fine-grained aspect-based sentiment extraction without task-specific training.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding