RGPT: recurrent boosting of LLMs lifts text-classification by ~1% per benchmark

February 12, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

11

Authors

Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, Jing Qin

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs for classification tasks, boosting+recurrent ensembling can usually add ~1% absolute accuracy—useful for high-stakes labeling or automation where small gains pay off, but expect higher compute and training cost.

Summary TLDR

The paper introduces RGPT, a boosting-style framework that repeatedly fine-tunes and recurrently ensembles large language models (LLMs) by reweighting hard training samples. On four benchmarks (SST-2, MR, AG News, Ohsumed) RGPT yields consistent accuracy gains (roughly 0.9–1.9% per dataset) over many strong PLM and LLM baselines and beats the average of three human annotators on a small mixed test. Key trade-offs: clear accuracy gains but higher compute due to multiple fine-tuning rounds and limited dataset scope.

Problem Statement

General LLMs and prompt methods are strong but inconsistent on standard text-classification tasks. The paper asks whether we can push classification performance further by building a specialized LLM via iterative reweighting and ensembling of fine-tuned base learners.

Main Contribution

RGPT: a boosting framework that repeatedly fine-tunes LLM base learners on reweighted training samples and ensembles them via recurrent prompts.

A recurrent ensembling scheme that feeds prior learners' predictions and error rates into later learners as context.

Extensive zero-shot experiments showing consistent accuracy gains over PLM- and prompt-based baselines across four benchmarks and a small human comparison.

Key Findings

RGPT improves accuracy over strong baselines on four standard datasets.

NumbersSST-2 +0.88%; AG News +1.21%; Ohsumed +1.47%; MR +1.88%

A small pool of learners (K = 7) yields near-best performance.

NumbersPerformance rose to 91.99% (avg) at K=7, plateau after 7–8 learners

Boosting the LLM contributes most to gains; recurrent ensembling adds additional improvement.

NumbersAblation: w/o Boosting drops to 89.23% (SST-2) vs RGPT 98.68%; w/o recurrent ensemble reduces gain vs full RGPT

RGPT beats average human annotator accuracy on a small mixed test set.

NumbersHumans avg 91.95% vs RGPT 92.54%; RGPT annotation time 10.9 min vs human avg 63.6 min

Method generalizes across base models but is compute-heavy.

NumbersRGPT applied to RoBERTa, Alpaca, ChatGLM2, LLaMA2 all improved; training a base learner ~1 hour on 8×A100

Results

Accuracy

Value98.68 ± 0.2%

BaselineCarp/ERNIE/T5 best ~97.80% (ERNIE) / 97.50% (T5) / 97.39% (CARP)

Accuracy

Value97.61 ± 0.3%

BaselineCARP 96.40%

Accuracy

Value77.41 ± 0.2%

BaselineDeBERTa 75.94%

Accuracy

Value94.27 ± 0.5%

BaselineDeBERTa 90.21% / CARP 92.39%

Accuracy

ValueRGPT 92.54% vs Human avg 91.95%

BaselineHuman avg

Who Should Care

What To Try In 7 Days

Fine-tune one LLM base on your labeled data and measure baseline accuracy.

Implement sample reweighting: upweight misclassified examples and fine-tune a second learner.

Ensemble 5–7 fine-tuned learners via the recurrent prompt trick (append prior prediction+error) and compare accuracy and runtime.

Agent Features

Tool Use

  • ChatGPT for synthetic sample generation

Frameworks

  • boosting
  • recurrent ensembling

Architectures

  • recurrently ensembled LLMs
  • boosting-style ensemble

Optimization Features

Infra Optimization

  • uses GPUs (8×A100 noted for training cost estimates)

Training Optimization

  • sample reweighting by error rate
  • iterative fine-tuning of base learners

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High compute due to multiple fine-tuning rounds (iterative boosting).
  • Evaluated on four datasets only; generality to other domains is untested.
  • Base learners were largely homogeneous (LLaMA variants); heterogeneous ensembles not explored.
  • Risk of overfitting when repeatedly fine-tuning on small datasets despite synthetic augmentation.

When Not To Use

  • When you lack GPU resources for multiple fine-tuning jobs.
  • When latency or inference cost must be minimal.
  • For tiny labeled datasets where repeated fine-tuning may overfit.

Failure Modes

  • Amplified bias or overfitting after repeated fine-tuning on narrow or synthetic samples.
  • Poor performance on highly imbalanced or fine-grained label sets (Ohsumed errors).
  • Diminishing returns as number of learners grows beyond ~7.

Core Entities

Models

  • RGPT (this work)
  • LLaMA 2
  • LLaMA 2-7B
  • LLaMA 2-13B
  • ChatGLM 2
  • GPT-4
  • Alpaca
  • RoBERTa
  • DeBERTa
  • T5
  • XLNet

Metrics

  • Accuracy
  • Macro-F1
  • Error rate
  • Annotation time (minutes)

Datasets

  • SST-2
  • MR
  • AG News
  • Ohsumed
  • IMDB
  • R8
  • DBPedia