Automate and iteratively improve text prompts using a dual-LLM generator + corrector to reduce hallucinations

Overview

Decision SnapshotNeeds Validation

The method shows large gains on some benchmarks (strong evidence for GSM8K and TruthfulQA) but is variable across datasets, depends on generator quality and training split size, and has no public code release.

Citations0

Evidence Strength0.70

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Jean Ghislain Billa, Min Oh, Liang Du

Links

Abstract / PDF / Data

Why It Matters For Business

SPT can raise task accuracy significantly without costly model fine-tuning; it offers a lower-barrier way to boost product QA and reduce hallucinations if you can afford extra API/computation and representative training data.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

SPT (Supervisory Prompt Training) runs two LLM agents: a generator that answers tasks and a corrector that examines generator mistakes and iteratively writes better textual meta-prompts. The corrector also refines itself and can attach sentence-level "impact scores" that measure how much each sentence improves accuracy. On multiple-choice hallucination benchmarks SPT often raises accuracy substantially (e.g., GPT‑4 on GSM8K from 65.8% to 94.1%), but gains vary by dataset and depend on the generator's inherent capacity, training-data size, and compute budget.

Problem Statement

High-quality prompts matter for LLM outputs, but manual prompt engineering is costly and brittle. The authors ask: can two LLMs automatically generate and iteratively improve human-readable prompts to reduce hallucinations and improve accuracy without changing model weights?

Main Contribution

Introduce SPT: a dual-LLM loop (generator + corrector) that iteratively produces improved textual meta-prompts from training mistakes.

Add impact scores: a sentence-level measure of how much adding a sentence changes generator accuracy, used to guide the corrector.

Key Findings

SPT raised GPT‑4 accuracy on GSM8K from 65.8% to 94.1% on evaluated splits.

Numbers65.8% -> 94.1% (+28.3 pp); Table 2

Practical UseFor math-style multiple-choice tasks, iterative prompt training can yield very large accuracy gains without model retraining; try SPT with a strong generator (e.g., GPT‑4) on your dev set first.

Evidence RefTable 2

SPT improved truthfulness on TruthfulQA for GPT‑4 by 10.3 percentage points on the paper's split.

Numbers81.7% -> 92.0% (+10.3 pp); Table 2

Practical UseIterative prompt refinement can reduce hallucination-style errors on open-domain truthfulness benchmarks; use SPT when factual accuracy matters and you can supply a representative training split.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	94.1%	65.8%	+28.3 pp	GSM8K (GPT‑4 generator, SPT-pc)	Table 2 reports GPT‑4 baseline 65.8% and SPT-pc 94.1% on GSM8K	Table 2
Accuracy	92.0%	81.7%	+10.3 pp	TruthfulQA (GPT‑4 generator, SPT-pc)	Table 2 shows GPT‑4 baseline 81.7% and SPT-pc 92.0%	Table 2

What To Try In 7 Days

Run SPT-pc with a strong corrector (e.g., GPT‑4) on a small, representative dev set and compare accuracy to your current prompts.

Compute and inspect impact scores to identify high-value sentences to keep or reuse across tasks.

Validate resulting prompts on held-out examples to detect overfitting before production rollout.

Agent Features

Memory

keeps training mistakes across epochs

Tool Use

LLM-to-LLM feedback loop

Is Agentic

Yes

Architectures

dual-LLM loop (generator + corrector)

Collaboration

generator and corrector iteratively improve prompts

Optimization Features

Training Optimization

iterative prompt candidate selection on training mistakes

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

TruthfulQAGSM8KMMLUMedQA-US

Risks & Boundaries

Limitations

Prompts can overfit the training split and not generalize to unseen data.

Requires substantial compute and API calls to iterate candidates and score them.

When Not To Use

In high-stakes settings without human review (medical/legal) because hallucinations may persist.

When compute or API budget is limited, since SPT is resource-intensive.

Failure Modes

Overfitting prompts to the training set semantics and question patterns.

Corrector producing suboptimal feedback if it itself is poorly prompted or weak.

Core Entities

Models

GPT-3.5-turbo (0314)GPT-4 (0314)Llama2-70b-chat

Metrics

Accuracy

Datasets

TruthfulQAGSM8KMMLUMedQA-US

Benchmarks

TruthfulQAGSM8KMMLUMedQA-US

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SPT raised GPT‑4 accuracy on GSM8K from 65.8% to 94.1% on evaluated splits.

SPT improved truthfulness on TruthfulQA for GPT‑4 by 10.3 percentage points on the paper's split.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding