RGPT: recurrent boosting of LLMs lifts text-classification by ~1% per benchmark

Overview

Decision SnapshotReady For Pilot

RGPT shows consistent small gains across four public datasets and ablations, but is limited by compute cost, a narrow dataset sweep, and potential overfitting from repeated fine-tuning.

Citations11

Evidence Strength0.75

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, Jing Qin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you use LLMs for classification tasks, boosting+recurrent ensembling can usually add ~1% absolute accuracy—useful for high-stakes labeling or automation where small gains pay off, but expect higher compute and training cost.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

The paper introduces RGPT, a boosting-style framework that repeatedly fine-tunes and recurrently ensembles large language models (LLMs) by reweighting hard training samples. On four benchmarks (SST-2, MR, AG News, Ohsumed) RGPT yields consistent accuracy gains (roughly 0.9–1.9% per dataset) over many strong PLM and LLM baselines and beats the average of three human annotators on a small mixed test. Key trade-offs: clear accuracy gains but higher compute due to multiple fine-tuning rounds and limited dataset scope.

Problem Statement

General LLMs and prompt methods are strong but inconsistent on standard text-classification tasks. The paper asks whether we can push classification performance further by building a specialized LLM via iterative reweighting and ensembling of fine-tuned base learners.

Main Contribution

RGPT: a boosting framework that repeatedly fine-tunes LLM base learners on reweighted training samples and ensembles them via recurrent prompts.

A recurrent ensembling scheme that feeds prior learners' predictions and error rates into later learners as context.

Key Findings

RGPT improves accuracy over strong baselines on four standard datasets.

NumbersSST-2 +0.88%; AG News +1.21%; Ohsumed +1.47%; MR +1.88%

Practical UseIf you can afford extra fine-tuning rounds, recurrent boosting can yield ~1% absolute accuracy gains on common text-classification benchmarks.

Evidence RefTable 2 (main results)

A small pool of learners (K = 7) yields near-best performance.

NumbersPerformance rose to 91.99% (avg) at K=7, plateau after 7–8 learners

Practical UseStart with ~5–7 boosted learners to balance accuracy and compute before diminishing returns kick in.

Evidence RefSec. 4.4 and Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	98.68 ± 0.2%	Carp/ERNIE/T5 best ~97.80% (ERNIE) / 97.50% (T5) / 97.39% (CARP)	+0.88%	SST-2 test	Table 2 reports RGPT 98.68% vs best baseline ~97.80%	Table 2
Accuracy	97.61 ± 0.3%	CARP 96.40%	+1.21%	AG News test	Table 2 reports RGPT 97.61% vs CARP 96.40%	Table 2

What To Try In 7 Days

Fine-tune one LLM base on your labeled data and measure baseline accuracy.

Implement sample reweighting: upweight misclassified examples and fine-tune a second learner.

Ensemble 5–7 fine-tuned learners via the recurrent prompt trick (append prior prediction+error) and compare accuracy and runtime.

Agent Features

Tool Use

ChatGPT for synthetic sample generation

Frameworks

boostingrecurrent ensembling

Architectures

recurrently ensembled LLMsboosting-style ensemble

Optimization Features

Infra Optimization

uses GPUs (8×A100 noted for training cost estimates)

Training Optimization

sample reweighting by error rateiterative fine-tuning of base learners

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/annoymity2024/RGPT_2024

Data URLs

http://davis.wpi.edu/xmdv/datasets/ohsumed.html

Risks & Boundaries

Limitations

High compute due to multiple fine-tuning rounds (iterative boosting).

Evaluated on four datasets only; generality to other domains is untested.

When Not To Use

When you lack GPU resources for multiple fine-tuning jobs.

When latency or inference cost must be minimal.

Failure Modes

Amplified bias or overfitting after repeated fine-tuning on narrow or synthetic samples.

Poor performance on highly imbalanced or fine-grained label sets (Ohsumed errors).

Core Entities

Models

RGPT (this work)LLaMA 2LLaMA 2-7BLLaMA 2-13BChatGLM 2GPT-4AlpacaRoBERTaDeBERTaT5XLNet

Metrics

AccuracyMacro-F1Error rateAnnotation time (minutes)

Datasets

SST-2MRAG NewsOhsumedIMDBR8DBPedia

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RGPT improves accuracy over strong baselines on four standard datasets.

A small pool of learners (K = 7) yields near-best performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding