Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
11
Why It Matters For Business
If you use LLMs for classification tasks, boosting+recurrent ensembling can usually add ~1% absolute accuracy—useful for high-stakes labeling or automation where small gains pay off, but expect higher compute and training cost.
Summary TLDR
The paper introduces RGPT, a boosting-style framework that repeatedly fine-tunes and recurrently ensembles large language models (LLMs) by reweighting hard training samples. On four benchmarks (SST-2, MR, AG News, Ohsumed) RGPT yields consistent accuracy gains (roughly 0.9–1.9% per dataset) over many strong PLM and LLM baselines and beats the average of three human annotators on a small mixed test. Key trade-offs: clear accuracy gains but higher compute due to multiple fine-tuning rounds and limited dataset scope.
Problem Statement
General LLMs and prompt methods are strong but inconsistent on standard text-classification tasks. The paper asks whether we can push classification performance further by building a specialized LLM via iterative reweighting and ensembling of fine-tuned base learners.
Main Contribution
RGPT: a boosting framework that repeatedly fine-tunes LLM base learners on reweighted training samples and ensembles them via recurrent prompts.
A recurrent ensembling scheme that feeds prior learners' predictions and error rates into later learners as context.
Extensive zero-shot experiments showing consistent accuracy gains over PLM- and prompt-based baselines across four benchmarks and a small human comparison.
Key Findings
RGPT improves accuracy over strong baselines on four standard datasets.
A small pool of learners (K = 7) yields near-best performance.
Boosting the LLM contributes most to gains; recurrent ensembling adds additional improvement.
RGPT beats average human annotator accuracy on a small mixed test set.
Method generalizes across base models but is compute-heavy.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Fine-tune one LLM base on your labeled data and measure baseline accuracy.
Implement sample reweighting: upweight misclassified examples and fine-tune a second learner.
Ensemble 5–7 fine-tuned learners via the recurrent prompt trick (append prior prediction+error) and compare accuracy and runtime.
Agent Features
Tool Use
- ChatGPT for synthetic sample generation
Frameworks
- boosting
- recurrent ensembling
Architectures
- recurrently ensembled LLMs
- boosting-style ensemble
Optimization Features
Infra Optimization
- uses GPUs (8×A100 noted for training cost estimates)
Training Optimization
- sample reweighting by error rate
- iterative fine-tuning of base learners
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High compute due to multiple fine-tuning rounds (iterative boosting).
- Evaluated on four datasets only; generality to other domains is untested.
- Base learners were largely homogeneous (LLaMA variants); heterogeneous ensembles not explored.
- Risk of overfitting when repeatedly fine-tuning on small datasets despite synthetic augmentation.
When Not To Use
- When you lack GPU resources for multiple fine-tuning jobs.
- When latency or inference cost must be minimal.
- For tiny labeled datasets where repeated fine-tuning may overfit.
Failure Modes
- Amplified bias or overfitting after repeated fine-tuning on narrow or synthetic samples.
- Poor performance on highly imbalanced or fine-grained label sets (Ohsumed errors).
- Diminishing returns as number of learners grows beyond ~7.
Core Entities
Models
- RGPT (this work)
- LLaMA 2
- LLaMA 2-7B
- LLaMA 2-13B
- ChatGLM 2
- GPT-4
- Alpaca
- RoBERTa
- DeBERTa
- T5
- XLNet
Metrics
- Accuracy
- Macro-F1
- Error rate
- Annotation time (minutes)
Datasets
- SST-2
- MR
- AG News
- Ohsumed
- IMDB
- R8
- DBPedia

