Iteratively prompt an LLM to produce filtered, diverse ABSA training data that rivals manual labels

June 29, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.65

Citation Count

1

Authors

Qihuang Zhong, Haiyun Li, Luyao Zhuang, Juhua Liu, Bo Du

Links

Abstract / PDF

Why It Matters For Business

IDG can produce usable labeled ABSA data from unlabeled text, lowering annotation cost and quickly bootstrapping sentiment models in new domains.

Summary TLDR

The paper presents IDG, a three-stage pipeline that uses an LLM (GPT‑3.5‑turbo) to extract domain aspects from unlabeled text, expand them, generate single- and multi-aspect sentence-aspect-polarity triplets via iterative prompting, and filter outputs with an LLM-based discriminator. On four SemEval ABSA benchmarks, synthetic data from IDG matches or improves performance of five baseline ABSA models. Key wins: generated-only training often approaches manual labels; mixing generated + original data yields consistent gains (up to +4.01% F1); discriminator and multi-aspect generation materially help. The method requires access to an LLM and careful aspect extraction and filtering.

Problem Statement

Aspect-based sentiment models need many labeled sentence–aspect–polarity examples but manual annotation is expensive. Existing augmentation methods either tweak words or paraphrase and still suffer poor fluency, low diversity, or require labeled seeds. Directly prompting LLMs is promising but leads to hallucinations and low-quality pseudo labels. The goal is to produce diverse, fluent, high-quality ABSA training data from an unlabeled corpus using LLMs while controlling hallucination.

Main Contribution

IDG: a three-stage, iterative LLM pipeline (aspect extraction/extension, iterative generation, LLM-based evaluation/filtering) to produce pseudo-labeled ABSA data from unlabeled text.

A self-reflection discriminator that uses the LLM as a judge plus automatic scoring to remove low-quality outputs.

Comprehensive evaluation on four SemEval ABSA benchmarks showing generated data can match or improve over manual labels and improves multiple baseline ABSA models when mixed with real data.

Key Findings

IDG-generated data can match or exceed manual training data on ABSA models.

NumbersR-GAT: Laptop14 F1 73.92→76.18 (+2.26); Rest14 F1 80.74→82.04 (+1.30)

Mixing IDG synthetic data with original labeled data consistently improves models.

NumbersUp to +4.01% F1 when mixing generated + original data on evaluated models

Filtering generated samples is critical for final model quality.

NumbersASGCN F1 drops 77.71→70.99 (−6.72) without discriminator

Generating multi-aspect sentences improves ABSA training over single-aspect only.

NumbersASGCN: Acc 76.09→80.62 (+4.53), F1 72.42→77.71 (+5.29); R-GAT F1 +6.96

Aspect extraction benefits from few-shot demonstrations and affects final performance.

NumbersAspect F1 on Laptop14: zero-shot 47.41 → few-shot random 58.13 (+10.72)

Results

Accuracy

Value80.25

Baseline78.37 (R-GAT base)

R-GAT F1

Value76.18

Baseline73.92 (R-GAT base)

R-GAT F1

Value82.04

Baseline80.74 (R-GAT base)

Mixing generated + original F1 gain

Valueup to 4.01

Baselineoriginal-only

Aspect extraction F1

Value58.13

Baseline47.41 (zero-shot)

ASGCN F1 drop without discriminator

Value70.99

Baseline77.71 (with discriminator)

Who Should Care

What To Try In 7 Days

Run IDG on your domain unlabeled corpus to generate ~1× training data and train a BERT-based ABSA model.

Enable few-shot examples for aspect extraction to raise aspect F1 quickly.

Include the discriminator (LLM-as-judge + score threshold) before training to avoid noisy samples harming performance.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires access to a high-quality LLM (authors use GPT‑3.5‑turbo); API cost and privacy may limit adoption.
  • Performance depends on accuracy of extracted aspects; gold aspects give a clear upper bound.
  • Filtering threshold T needs tuning (authors find T=6 best); overfiltering reduces effective data.
  • Evaluations are on SemEval restaurant/laptop datasets; cross-domain generalization needs more tests.

When Not To Use

  • You already have ample, high-quality labeled ABSA data — manual labels may be better.
  • When LLM use is disallowed for privacy or compliance reasons.
  • If you lack compute/budget for repeated LLM calls for generation and self-reflection.

Failure Modes

  • LLM hallucination produces wrong aspect–polarity pairs that degrade training if not filtered.
  • Repetitive low-diversity outputs without iterative feedback reduce model gains.
  • Overly strict filtering removes too much data and hurts downstream learning.
  • Poor few-shot or domain demonstrations cause weak aspect extraction and noisy generation.

Core Entities

Models

  • GPT-3.5-turbo (LLM for generation and judging)
  • BERT-base-uncased (backbone for downstream ABSA)
  • ATAE-LSTM
  • ASGCN
  • BERT-SPC
  • R-GAT
  • KGAN
  • R-GAT (used heavily in comparisons)

Metrics

  • Accuracy
  • F1
  • Precision
  • Recall
  • Macro-F1

Datasets

  • Laptop14 (SemEval2014)
  • Restaurant14 (SemEval2014)
  • Restaurant15 (SemEval2015)
  • Restaurant16 (SemEval2016)

Benchmarks

  • SemEval 2014/2015/2016 ABSA benchmarks