Automate and iteratively improve text prompts using a dual-LLM generator + corrector to reduce hallucinations

March 26, 20246 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Jean Ghislain Billa, Min Oh, Liang Du

Links

Abstract / PDF

Why It Matters For Business

SPT can raise task accuracy significantly without costly model fine-tuning; it offers a lower-barrier way to boost product QA and reduce hallucinations if you can afford extra API/computation and representative training data.

Summary TLDR

SPT (Supervisory Prompt Training) runs two LLM agents: a generator that answers tasks and a corrector that examines generator mistakes and iteratively writes better textual meta-prompts. The corrector also refines itself and can attach sentence-level "impact scores" that measure how much each sentence improves accuracy. On multiple-choice hallucination benchmarks SPT often raises accuracy substantially (e.g., GPT‑4 on GSM8K from 65.8% to 94.1%), but gains vary by dataset and depend on the generator's inherent capacity, training-data size, and compute budget.

Problem Statement

High-quality prompts matter for LLM outputs, but manual prompt engineering is costly and brittle. The authors ask: can two LLMs automatically generate and iteratively improve human-readable prompts to reduce hallucinations and improve accuracy without changing model weights?

Main Contribution

Introduce SPT: a dual-LLM loop (generator + corrector) that iteratively produces improved textual meta-prompts from training mistakes.

Add impact scores: a sentence-level measure of how much adding a sentence changes generator accuracy, used to guide the corrector.

Empirically show large accuracy gains on some multiple-choice benchmarks (notably math and truthfulness tasks) without model fine-tuning.

Key Findings

SPT raised GPT‑4 accuracy on GSM8K from 65.8% to 94.1% on evaluated splits.

Numbers65.8% -> 94.1% (+28.3 pp); Table 2

SPT improved truthfulness on TruthfulQA for GPT‑4 by 10.3 percentage points on the paper's split.

Numbers81.7% -> 92.0% (+10.3 pp); Table 2

Improvements are dataset- and model-dependent; some tasks saw little or no gain and occasional drops.

NumbersMMLU: 79.7% -> 80.1% (+0.4 pp); MedQA mixed (GPT variants sometimes worse); Table 2–3

Results

Accuracy

Value94.1%

Baseline65.8%

Accuracy

Value92.0%

Baseline81.7%

Accuracy

Value79.2%

Baseline64.0%

Accuracy

Value80.1%

Baseline79.7%

Accuracy

Value46.1%

Baseline41.6%

Who Should Care

What To Try In 7 Days

Run SPT-pc with a strong corrector (e.g., GPT‑4) on a small, representative dev set and compare accuracy to your current prompts.

Compute and inspect impact scores to identify high-value sentences to keep or reuse across tasks.

Validate resulting prompts on held-out examples to detect overfitting before production rollout.

Agent Features

Memory

  • keeps training mistakes across epochs

Tool Use

  • LLM-to-LLM feedback loop

Is Agentic

true

Architectures

  • dual-LLM loop (generator + corrector)

Collaboration

  • generator and corrector iteratively improve prompts

Optimization Features

Training Optimization

  • iterative prompt candidate selection on training mistakes

Reproducibility

Data Urls

  • TruthfulQA
  • GSM8K
  • MMLU
  • MedQA-US

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Prompts can overfit the training split and not generalize to unseen data.
  • Requires substantial compute and API calls to iterate candidates and score them.
  • Effectiveness depends on generator model capacity; weaker models show smaller gains.
  • Iterative process can produce very long prompts that hurt interpretability.

When Not To Use

  • In high-stakes settings without human review (medical/legal) because hallucinations may persist.
  • When compute or API budget is limited, since SPT is resource-intensive.
  • When you lack a representative training split to guide prompt search.

Failure Modes

  • Overfitting prompts to the training set semantics and question patterns.
  • Corrector producing suboptimal feedback if it itself is poorly prompted or weak.
  • Lengthy prompts that degrade inference speed or make debugging impractical.

Core Entities

Models

  • GPT-3.5-turbo (0314)
  • GPT-4 (0314)
  • Llama2-70b-chat

Metrics

  • Accuracy

Datasets

  • TruthfulQA
  • GSM8K
  • MMLU
  • MedQA-US

Benchmarks

  • TruthfulQA
  • GSM8K
  • MMLU
  • MedQA-US