Pick the best prompt per query offline using inverse RL and cheap embeddings

Overview

Decision SnapshotReady For Pilot

The method is simple: train a classifier on (query,prompt) embeddings from past LLM runs, use it to score candidate prompts, and call the LLM once with the best prompt. Evidence shows consistent gains and big cost savings, but performance depends on the quality and coverage of offline logs.

Citations5

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Hao Sun, Alihan Hüyük, Mihaela van der Schaar

Links

Abstract / PDF / Code

Why It Matters For Business

You can cut prompt-evaluation costs and improve per-query outputs by training a small offline reward model on past prompt logs and using it to pick prompts instead of repeatedly calling expensive LLM verification.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

Prompts that work best vary by query. The paper proposes Prompt-OIRL: learn a proxy reward model from existing prompt–LLM interaction logs (embeddings + XGBoost), then at inference generate N candidate prompts and pick the prompt with highest predicted reward. This avoids expensive LLM-based evaluation, predicts which prompt will yield a correct answer, and reduces per-query inference cost. Experiments on arithmetic datasets (GSM8K, SVAMP, MAWPS) across GPT-3.5-turbo, LLaMA-2-7B-Chat and TigerBot-13B-Chat show sizable accuracy and cost gains versus query-agnostic selection and LLM self-critique baselines.

Problem Statement

Standard zero-shot prompt search finds one prompt that is best on average. But the best prompt often depends on the individual query. Two practical problems block per-query prompt selection: (1) you cannot evaluate which prompted answer is correct at inference without ground truth, and (2) online trial-and-error with LLM calls is expensive. The paper addresses both problems with offline learning from prior prompt evaluation logs.

Main Contribution

Formally define query-dependent zero-shot prompt optimization (choose a prompt per query).

Introduce Prompt-OIRL: learn an offline proxy reward model over (query, prompt) pairs via inverse RL from existing interaction logs.

Key Findings

Prompt-OIRL improves correctness when only one demonstration prompt is available.

Numbers+24.3%

Practical UseIf you only have a single tested prompt, train a reward model and use per-query best-of-N selection to gain large accuracy uplift.

Evidence RefSection 5.1, Figure 5 (scarce K=1)

Prompt-OIRL outperforms query-agnostic prompt selection with more training prompts.

Numbers+8.8%

Practical UseEven with multiple known prompts, picking prompts per query via a learned reward model yields measurable gains.

Evidence RefSection 5.1, Figure 5 (rich K=5)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success-rate gain (scarce demos, K=1)	+24.3%	Best-of-training prompt	+24.3%	averaged tasks/LLMs (Figure 5, scarce)	Section 5.1 reports +24.3% improvement over BoTr Eqn.(1) when K=1	Section 5.1, Figure 5
Success-rate gain (rich demos, K=5)	+8.8%	Query-agnostic objective	+8.8%	averaged tasks/LLMs (Figure 5, rich)	Section 5.1 reports +8.8% improvement over BoTr Eqn.(1) when K=5	Section 5.1, Figure 5

What To Try In 7 Days

Collect past prompt–query–response logs from your evaluations or benchmarks.

Compute query and prompt embeddings and train a lightweight classifier (e.g., XGBoost) to predict correctness.

Generate N candidate prompts per query (N=10–100) and pick the highest-scoring prompt, then call the LLM once with that prompt and compare results vs your baseline.

Agent Features

Tool Use

best-of-N prompt generation

Frameworks

Optimization Features

Token Efficiency

reduces repeated LLM calls and token usage for evaluation

Infra Optimization

embeddings + XGBoost runs on CPU; minimal GPU needed

Training Optimization

learn reward on embeddings offline to avoid LLM calls

Inference Optimization

only one LLM call per query after reward-based selection

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/vanderschaarlab/Prompt-OIRL

Risks & Boundaries

Limitations

Requires an offline log of query–prompt–response instances; gains shrink if logs are missing or unrepresentative.

Reward model quality depends on the LLM used to produce the demonstrations; it may not generalize across very different LLMs.

When Not To Use

When you have no logged prompt–response data to train a reward model.

If your deployment LLM is very different from models used to create the logs and you cannot retrain the reward model.

Failure Modes

Reward model overfits to training prompts and fails on novel prompt styles.

Poor embeddings or small datasets lead to low precision, causing selection of bad prompts.

Core Entities

Models

GPT-3.5-turboLLaMA-2-7B-ChatTigerBot-13B-ChatGPT-4 (appendix experiments)

Metrics

Accuracyprecisionsuccess rateinference cost (USD / GPU-hour)

Datasets

GSM8KSVAMPMAWPS

Benchmarks

arithmetic reasoning success rate

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prompt-OIRL improves correctness when only one demonstration prompt is available.

Prompt-OIRL outperforms query-agnostic prompt selection with more training prompts.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding