Automate and iteratively improve text prompts using a dual-LLM generator + corrector to reduce hallucinations

March 26, 20246 min

Overview

Decision SnapshotNeeds Validation

The method shows large gains on some benchmarks (strong evidence for GSM8K and TruthfulQA) but is variable across datasets, depends on generator quality and training split size, and has no public code release.

Citations0

Evidence Strength0.70

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Jean Ghislain Billa, Min Oh, Liang Du

Links

Abstract / PDF / Data

Why It Matters For Business

SPT can raise task accuracy significantly without costly model fine-tuning; it offers a lower-barrier way to boost product QA and reduce hallucinations if you can afford extra API/computation and representative training data.

Who Should Care

Summary TLDR

SPT (Supervisory Prompt Training) runs two LLM agents: a generator that answers tasks and a corrector that examines generator mistakes and iteratively writes better textual meta-prompts. The corrector also refines itself and can attach sentence-level "impact scores" that measure how much each sentence improves accuracy. On multiple-choice hallucination benchmarks SPT often raises accuracy substantially (e.g., GPT‑4 on GSM8K from 65.8% to 94.1%), but gains vary by dataset and depend on the generator's inherent capacity, training-data size, and compute budget.

Problem Statement

High-quality prompts matter for LLM outputs, but manual prompt engineering is costly and brittle. The authors ask: can two LLMs automatically generate and iteratively improve human-readable prompts to reduce hallucinations and improve accuracy without changing model weights?

Main Contribution

Introduce SPT: a dual-LLM loop (generator + corrector) that iteratively produces improved textual meta-prompts from training mistakes.

Add impact scores: a sentence-level measure of how much adding a sentence changes generator accuracy, used to guide the corrector.

Key Findings

SPT raised GPT‑4 accuracy on GSM8K from 65.8% to 94.1% on evaluated splits.

Numbers65.8% -> 94.1% (+28.3 pp); Table 2

Practical UseFor math-style multiple-choice tasks, iterative prompt training can yield very large accuracy gains without model retraining; try SPT with a strong generator (e.g., GPT‑4) on your dev set first.

Evidence RefTable 2

SPT improved truthfulness on TruthfulQA for GPT‑4 by 10.3 percentage points on the paper's split.

Numbers81.7% -> 92.0% (+10.3 pp); Table 2

Practical UseIterative prompt refinement can reduce hallucination-style errors on open-domain truthfulness benchmarks; use SPT when factual accuracy matters and you can supply a representative training split.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy94.1%65.8%+28.3 ppGSM8K (GPT‑4 generator, SPT-pc)Table 2 reports GPT‑4 baseline 65.8% and SPT-pc 94.1% on GSM8KTable 2
Accuracy92.0%81.7%+10.3 ppTruthfulQA (GPT‑4 generator, SPT-pc)Table 2 shows GPT‑4 baseline 81.7% and SPT-pc 92.0%Table 2

What To Try In 7 Days

Run SPT-pc with a strong corrector (e.g., GPT‑4) on a small, representative dev set and compare accuracy to your current prompts.

Compute and inspect impact scores to identify high-value sentences to keep or reuse across tasks.

Validate resulting prompts on held-out examples to detect overfitting before production rollout.

Agent Features

Memory
keeps training mistakes across epochs
Tool Use
LLM-to-LLM feedback loop
Is Agentic

Yes

Architectures
dual-LLM loop (generator + corrector)
Collaboration
generator and corrector iteratively improve prompts

Optimization Features

Training Optimization
iterative prompt candidate selection on training mistakes

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

TruthfulQAGSM8KMMLUMedQA-US

Risks & Boundaries

Limitations

Prompts can overfit the training split and not generalize to unseen data.

Requires substantial compute and API calls to iterate candidates and score them.

When Not To Use

In high-stakes settings without human review (medical/legal) because hallucinations may persist.

When compute or API budget is limited, since SPT is resource-intensive.

Failure Modes

Overfitting prompts to the training set semantics and question patterns.

Corrector producing suboptimal feedback if it itself is poorly prompted or weak.

Core Entities

Models

GPT-3.5-turbo (0314)GPT-4 (0314)Llama2-70b-chat

Metrics

Accuracy

Datasets

TruthfulQAGSM8KMMLUMedQA-US

Benchmarks

TruthfulQAGSM8KMMLUMedQA-US