Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
SPT can raise task accuracy significantly without costly model fine-tuning; it offers a lower-barrier way to boost product QA and reduce hallucinations if you can afford extra API/computation and representative training data.
Summary TLDR
SPT (Supervisory Prompt Training) runs two LLM agents: a generator that answers tasks and a corrector that examines generator mistakes and iteratively writes better textual meta-prompts. The corrector also refines itself and can attach sentence-level "impact scores" that measure how much each sentence improves accuracy. On multiple-choice hallucination benchmarks SPT often raises accuracy substantially (e.g., GPT‑4 on GSM8K from 65.8% to 94.1%), but gains vary by dataset and depend on the generator's inherent capacity, training-data size, and compute budget.
Problem Statement
High-quality prompts matter for LLM outputs, but manual prompt engineering is costly and brittle. The authors ask: can two LLMs automatically generate and iteratively improve human-readable prompts to reduce hallucinations and improve accuracy without changing model weights?
Main Contribution
Introduce SPT: a dual-LLM loop (generator + corrector) that iteratively produces improved textual meta-prompts from training mistakes.
Add impact scores: a sentence-level measure of how much adding a sentence changes generator accuracy, used to guide the corrector.
Empirically show large accuracy gains on some multiple-choice benchmarks (notably math and truthfulness tasks) without model fine-tuning.
Key Findings
SPT raised GPT‑4 accuracy on GSM8K from 65.8% to 94.1% on evaluated splits.
SPT improved truthfulness on TruthfulQA for GPT‑4 by 10.3 percentage points on the paper's split.
Improvements are dataset- and model-dependent; some tasks saw little or no gain and occasional drops.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run SPT-pc with a strong corrector (e.g., GPT‑4) on a small, representative dev set and compare accuracy to your current prompts.
Compute and inspect impact scores to identify high-value sentences to keep or reuse across tasks.
Validate resulting prompts on held-out examples to detect overfitting before production rollout.
Agent Features
Memory
- keeps training mistakes across epochs
Tool Use
- LLM-to-LLM feedback loop
Is Agentic
true
Architectures
- dual-LLM loop (generator + corrector)
Collaboration
- generator and corrector iteratively improve prompts
Optimization Features
Training Optimization
- iterative prompt candidate selection on training mistakes
Reproducibility
Data Urls
- TruthfulQA
- GSM8K
- MMLU
- MedQA-US
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Prompts can overfit the training split and not generalize to unseen data.
- Requires substantial compute and API calls to iterate candidates and score them.
- Effectiveness depends on generator model capacity; weaker models show smaller gains.
- Iterative process can produce very long prompts that hurt interpretability.
When Not To Use
- In high-stakes settings without human review (medical/legal) because hallucinations may persist.
- When compute or API budget is limited, since SPT is resource-intensive.
- When you lack a representative training split to guide prompt search.
Failure Modes
- Overfitting prompts to the training set semantics and question patterns.
- Corrector producing suboptimal feedback if it itself is poorly prompted or weak.
- Lengthy prompts that degrade inference speed or make debugging impractical.
Core Entities
Models
- GPT-3.5-turbo (0314)
- GPT-4 (0314)
- Llama2-70b-chat
Metrics
- Accuracy
Datasets
- TruthfulQA
- GSM8K
- MMLU
- MedQA-US
Benchmarks
- TruthfulQA
- GSM8K
- MMLU
- MedQA-US

