RGPT: recurrent boosting of LLMs lifts text-classification by ~1% per benchmark

February 12, 20247 min

Overview

Decision SnapshotReady For Pilot

RGPT shows consistent small gains across four public datasets and ablations, but is limited by compute cost, a narrow dataset sweep, and potential overfitting from repeated fine-tuning.

Citations11

Evidence Strength0.75

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, Jing Qin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs for classification tasks, boosting+recurrent ensembling can usually add ~1% absolute accuracy—useful for high-stakes labeling or automation where small gains pay off, but expect higher compute and training cost.

Who Should Care

Summary TLDR

The paper introduces RGPT, a boosting-style framework that repeatedly fine-tunes and recurrently ensembles large language models (LLMs) by reweighting hard training samples. On four benchmarks (SST-2, MR, AG News, Ohsumed) RGPT yields consistent accuracy gains (roughly 0.9–1.9% per dataset) over many strong PLM and LLM baselines and beats the average of three human annotators on a small mixed test. Key trade-offs: clear accuracy gains but higher compute due to multiple fine-tuning rounds and limited dataset scope.

Problem Statement

General LLMs and prompt methods are strong but inconsistent on standard text-classification tasks. The paper asks whether we can push classification performance further by building a specialized LLM via iterative reweighting and ensembling of fine-tuned base learners.

Main Contribution

RGPT: a boosting framework that repeatedly fine-tunes LLM base learners on reweighted training samples and ensembles them via recurrent prompts.

A recurrent ensembling scheme that feeds prior learners' predictions and error rates into later learners as context.

Key Findings

RGPT improves accuracy over strong baselines on four standard datasets.

NumbersSST-2 +0.88%; AG News +1.21%; Ohsumed +1.47%; MR +1.88%

Practical UseIf you can afford extra fine-tuning rounds, recurrent boosting can yield ~1% absolute accuracy gains on common text-classification benchmarks.

Evidence RefTable 2 (main results)

A small pool of learners (K = 7) yields near-best performance.

NumbersPerformance rose to 91.99% (avg) at K=7, plateau after 78 learners

Practical UseStart with ~5–7 boosted learners to balance accuracy and compute before diminishing returns kick in.

Evidence RefSec. 4.4 and Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy98.68 ± 0.2%Carp/ERNIE/T5 best ~97.80% (ERNIE) / 97.50% (T5) / 97.39% (CARP)+0.88%SST-2 testTable 2 reports RGPT 98.68% vs best baseline ~97.80%Table 2
Accuracy97.61 ± 0.3%CARP 96.40%+1.21%AG News testTable 2 reports RGPT 97.61% vs CARP 96.40%Table 2

What To Try In 7 Days

Fine-tune one LLM base on your labeled data and measure baseline accuracy.

Implement sample reweighting: upweight misclassified examples and fine-tune a second learner.

Ensemble 5–7 fine-tuned learners via the recurrent prompt trick (append prior prediction+error) and compare accuracy and runtime.

Agent Features

Tool Use
ChatGPT for synthetic sample generation
Frameworks
boostingrecurrent ensembling
Architectures
recurrently ensembled LLMsboosting-style ensemble

Optimization Features

Infra Optimization
uses GPUs (8×A100 noted for training cost estimates)
Training Optimization
sample reweighting by error rateiterative fine-tuning of base learners

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High compute due to multiple fine-tuning rounds (iterative boosting).

Evaluated on four datasets only; generality to other domains is untested.

When Not To Use

When you lack GPU resources for multiple fine-tuning jobs.

When latency or inference cost must be minimal.

Failure Modes

Amplified bias or overfitting after repeated fine-tuning on narrow or synthetic samples.

Poor performance on highly imbalanced or fine-grained label sets (Ohsumed errors).

Core Entities

Models

RGPT (this work)LLaMA 2LLaMA 2-7BLLaMA 2-13BChatGLM 2GPT-4AlpacaRoBERTaDeBERTaT5XLNet

Metrics

AccuracyMacro-F1Error rateAnnotation time (minutes)

Datasets

SST-2MRAG NewsOhsumedIMDBR8DBPedia