LLMs excel at simple sentiment tasks but struggle with fine-grained, structured sentiment extraction

May 24, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

55

Authors

Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, Lidong Bing

Links

Abstract / PDF

Why It Matters For Business

Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.

Summary TLDR

This paper runs a broad, practical evaluation of LLMs (Flan-T5, Flan-UL2, text-davinci-003, ChatGPT) vs a fine-tuned T5 (770M) across 13 sentiment tasks on 26 datasets. Findings: LLMs match or nearly match fine-tuned models on simple sentiment classification zero-shot, fail at structured aspect-level extraction, and strongly beat small models in few-shot settings. The authors introduce SENTIEVAL, a unified prompt-robust benchmark and show prompt design, format compliance, and context-length limits remain key blockers.

Problem Statement

Can current large language models reliably solve the full range of sentiment-analysis problems — from basic polarity classification to aspect-level extraction and nuanced subjective analysis — and how do they compare to smaller, task-trained models in zero-shot and few-shot settings?

Main Contribution

Systematic evaluation of LLMs on 13 sentiment tasks across 26 datasets, covering sentiment classification, ABSA, and multifaceted subjective tasks.

Empirical finding that LLMs perform well zero-shot on simple classification but underperform on structured, fine-grained tasks.

Demonstration that LLMs outperform small fine-tuned models in few-shot in-context learning when labeled data is scarce.

Identification of practical evaluation issues: prompt sensitivity, format compliance, and context-length limits.

Release of SENTIEVAL: a benchmark with diverse prompts and mixed few-shot/zero-shot queries to produce more robust comparisons.

Key Findings

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

NumbersChatGPT ≈97% of T5 performance on SC tasks (paper text).

LLMs lag on fine-grained aspect-based sentiment extraction without task-specific training.

NumbersABSA zero-shot average: ChatGPT 37.09 vs fine-tuned T5 61.06 (micro-F1, evaluated datasets).

LLMs outperform small models in few-shot settings when labeled data is scarce.

NumbersAcross k-shot tests (1/5/10), LLMs consistently beat T5 trained on the same few examples (Table 4).

Prompt design and output format strongly affect results, especially for structured tasks.

NumbersHuman eval: ABSA 'relaxed' acceptance up to 68.33% vs strict 58.33% (UABSA); prompt variance shown in Figure 2.

ChatGPT shows weaker performance on hate/irony/offensive detection compared to other LLMs.

NumbersZero-shot HatEval: ChatGPT 50.92 vs text-003 67.79; Irony/Offensive also lower (Table 2).

A unified, prompt-diverse benchmark (SENTIEVAL) reveals persistent gaps on complex tasks.

NumbersSENTIEVAL exact-match: ChatGPT 47.55, Flan-UL2 38.82, text-003 36.64 (overall).

Results

SC average (zero-shot)

ValueChatGPT 78.24 vs T5 80.65 (accuracy avg over SC datasets)

BaselineT5 large (fine-tuned)

Yelp-5 (document-level, zero-shot)

ValueChatGPT 52.40% vs T5 65.60% (accuracy)

BaselineT5 large (fine-tuned)

Comparative opinions (CS19, zero-shot)

ValueChatGPT 72.80% vs T5 80.35% (accuracy)

BaselineT5 large (fine-tuned)

ABSA average (zero-shot micro-F1)

ValueChatGPT 37.09 vs T5 61.06

BaselineT5 large (fine-tuned)

SENTIEVAL overall (exact match)

ValueChatGPT 47.55, Flan-UL2 38.82, text-003 36.64

BaselineFlan-UL2/text-003

Who Should Care

What To Try In 7 Days

Run a quick zero-shot pilot: apply ChatGPT or Flan-UL2 to your binary sentiment labels and compare to existing classifiers.

If you have <100 labels per class, test LLM few-shot prompts before training a specialist model.

For aspect extraction, run LLM outputs through a small validation set or human review before trusting automation.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High sensitivity to prompt wording; single prompt can misrepresent model ability.
  • LLMs often fail to produce the exact structured format required by ABSA metrics.
  • Context length limits constrain few-shot scaling and document-level tasks.
  • Observed RLHF over-alignment can reduce detection accuracy for offensive or hateful content.

When Not To Use

  • When you need exact, structured aspect/opinion triples or quadruples without human validation.
  • As the sole detector for safety-sensitive content (hate/offensive) without specialized controls.
  • When long documents exceed the model's few-shot context window and you need end-to-end fine-grained labeling.

Failure Modes

  • Outputs that violate the required format, causing automatic-evaluation penalties.
  • Prompt-induced variance: different natural phrasings can change results dramatically.
  • Over-alignment bias: reduced detection of hate/offensive content in some LLMs.
  • Degraded performance when few-shot context becomes too long or noisy.

Core Entities

Models

  • Flan-T5 (13B)
  • Flan-UL2 (20B)
  • text-davinci-003 (text-003, 175B)
  • ChatGPT (gpt-3.5-turbo)
  • T5-large (770M, fine-tuned SLM baseline)

Metrics

  • Accuracy
  • micro_f1
  • macro_f1
  • f1(irony)
  • exact-match (SENTIEVAL)

Datasets

  • IMDb
  • Yelp-2
  • Yelp-5
  • MR
  • SST-2
  • SST-5
  • Twitter
  • Lap14
  • Rest14
  • Rest15
  • Rest16
  • Laptop14
  • UABSA (SemEval)
  • ASTE datasets
  • ASQP (Rest15/Rest16)
  • HatEval
  • Irony18
  • OffensEval
  • Stance16
  • CS19 (comparative)
  • Emotion20
  • Implicit (Lap+Res)

Benchmarks

  • SENTIEVAL