Iteratively prompt an LLM to produce filtered, diverse ABSA training data that rivals manual labels

Overview

Decision SnapshotNeeds Validation

The method is experimentally validated across four standard ABSA datasets with multiple baselines, but it depends on a closed LLM API and on aspect-extraction quality.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 45%

Authors

Qihuang Zhong, Haiyun Li, Luyao Zhuang, Juhua Liu, Bo Du

Links

Abstract / PDF

Why It Matters For Business

IDG can produce usable labeled ABSA data from unlabeled text, lowering annotation cost and quickly bootstrapping sentiment models in new domains.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The paper presents IDG, a three-stage pipeline that uses an LLM (GPT‑3.5‑turbo) to extract domain aspects from unlabeled text, expand them, generate single- and multi-aspect sentence-aspect-polarity triplets via iterative prompting, and filter outputs with an LLM-based discriminator. On four SemEval ABSA benchmarks, synthetic data from IDG matches or improves performance of five baseline ABSA models. Key wins: generated-only training often approaches manual labels; mixing generated + original data yields consistent gains (up to +4.01% F1); discriminator and multi-aspect generation materially help. The method requires access to an LLM and careful aspect extraction and filtering.

Problem Statement

Aspect-based sentiment models need many labeled sentence–aspect–polarity examples but manual annotation is expensive. Existing augmentation methods either tweak words or paraphrase and still suffer poor fluency, low diversity, or require labeled seeds. Directly prompting LLMs is promising but leads to hallucinations and low-quality pseudo labels. The goal is to produce diverse, fluent, high-quality ABSA training data from an unlabeled corpus using LLMs while controlling hallucination.

Main Contribution

IDG: a three-stage, iterative LLM pipeline (aspect extraction/extension, iterative generation, LLM-based evaluation/filtering) to produce pseudo-labeled ABSA data from unlabeled text.

A self-reflection discriminator that uses the LLM as a judge plus automatic scoring to remove low-quality outputs.

Key Findings

IDG-generated data can match or exceed manual training data on ABSA models.

NumbersR-GAT: Laptop14 F1 73.92→76.18 (+2.26); Rest14 F1 80.74→82.04 (+1.30)

Practical UseYou can train ABSA models without labeled data in some domains and expect near-manual performance; use IDG when annotations are costly.

Evidence RefTable IV; Table II

Mixing IDG synthetic data with original labeled data consistently improves models.

NumbersUp to +4.01% F1 when mixing generated + original data on evaluated models

Practical UseIf you have some labeled data, add IDG data to get a reliable boost; it's low-cost augmentation.

Evidence RefSection IV-B, Table II

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80.25	78.37 (R-GAT base)	+1.88	Laptop14 (Generated data)	Table IV: R-GAT + IDG Acc 80.25 vs baseline 78.37	Table IV
R-GAT F1	76.18	73.92 (R-GAT base)	+2.26	Laptop14 (Generated data)	Table IV: R-GAT + IDG F1 76.18 vs baseline 73.92	Table IV

What To Try In 7 Days

Run IDG on your domain unlabeled corpus to generate ~1× training data and train a BERT-based ABSA model.

Enable few-shot examples for aspect extraction to raise aspect F1 quickly.

Include the discriminator (LLM-as-judge + score threshold) before training to avoid noisy samples harming performance.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires access to a high-quality LLM (authors use GPT‑3.5‑turbo); API cost and privacy may limit adoption.

Performance depends on accuracy of extracted aspects; gold aspects give a clear upper bound.

When Not To Use

You already have ample, high-quality labeled ABSA data — manual labels may be better.

When LLM use is disallowed for privacy or compliance reasons.

Failure Modes

LLM hallucination produces wrong aspect–polarity pairs that degrade training if not filtered.

Repetitive low-diversity outputs without iterative feedback reduce model gains.

Core Entities

Models

GPT-3.5-turbo (LLM for generation and judging)BERT-base-uncased (backbone for downstream ABSA)ATAE-LSTMASGCNBERT-SPCR-GATKGANR-GAT (used heavily in comparisons)

Metrics

AccuracyF1PrecisionRecallMacro-F1

Datasets

Laptop14 (SemEval2014)Restaurant14 (SemEval2014)Restaurant15 (SemEval2015)Restaurant16 (SemEval2016)

Benchmarks

SemEval 2014/2015/2016 ABSA benchmarks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

IDG-generated data can match or exceed manual training data on ABSA models.

Mixing IDG synthetic data with original labeled data consistently improves models.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding