Pick the best prompt per query offline using inverse RL and cheap embeddings

September 13, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

5

Authors

Hao Sun, Alihan Hüyük, Mihaela van der Schaar

Links

Abstract / PDF

Why It Matters For Business

You can cut prompt-evaluation costs and improve per-query outputs by training a small offline reward model on past prompt logs and using it to pick prompts instead of repeatedly calling expensive LLM verification.

Summary TLDR

Prompts that work best vary by query. The paper proposes Prompt-OIRL: learn a proxy reward model from existing prompt–LLM interaction logs (embeddings + XGBoost), then at inference generate N candidate prompts and pick the prompt with highest predicted reward. This avoids expensive LLM-based evaluation, predicts which prompt will yield a correct answer, and reduces per-query inference cost. Experiments on arithmetic datasets (GSM8K, SVAMP, MAWPS) across GPT-3.5-turbo, LLaMA-2-7B-Chat and TigerBot-13B-Chat show sizable accuracy and cost gains versus query-agnostic selection and LLM self-critique baselines.

Problem Statement

Standard zero-shot prompt search finds one prompt that is best on average. But the best prompt often depends on the individual query. Two practical problems block per-query prompt selection: (1) you cannot evaluate which prompted answer is correct at inference without ground truth, and (2) online trial-and-error with LLM calls is expensive. The paper addresses both problems with offline learning from prior prompt evaluation logs.

Main Contribution

Formally define query-dependent zero-shot prompt optimization (choose a prompt per query).

Introduce Prompt-OIRL: learn an offline proxy reward model over (query, prompt) pairs via inverse RL from existing interaction logs.

Show a low-cost best-of-N selection procedure using embeddings + XGBoost that improves arithmetic reasoning accuracy and cuts inference cost.

Key Findings

Prompt-OIRL improves correctness when only one demonstration prompt is available.

Numbers+24.3%

Prompt-OIRL outperforms query-agnostic prompt selection with more training prompts.

Numbers+8.8%

Learned reward model predicts correctness better than LLM self-critique on held-out queries.

Numbersaccuracy 0.96 vs 0.662

Per-query inference cost is much lower than LLM-based selection.

Numbers≈$0.00041 vs $0.00558 per query (K=6, GPT-3.5)

Results

Success-rate gain (scarce demos, K=1)

Value+24.3%

BaselineBest-of-training prompt

Success-rate gain (rich demos, K=5)

Value+8.8%

BaselineQuery-agnostic objective

Accuracy

Value0.96 vs 0.662 (accuracy)

BaselineLLM self-critique (LMSC)

Per-query inference cost (K=6)

Value≈$0.00041 vs $0.00558

BaselineLLM-based self-critique

Who Should Care

What To Try In 7 Days

Collect past prompt–query–response logs from your evaluations or benchmarks.

Compute query and prompt embeddings and train a lightweight classifier (e.g., XGBoost) to predict correctness.

Generate N candidate prompts per query (N=10–100) and pick the highest-scoring prompt, then call the LLM once with that prompt and compare results vs your baseline.

Agent Features

Tool Use

  • best-of-N prompt generation

Frameworks

  • RL

Optimization Features

Token Efficiency

  • reduces repeated LLM calls and token usage for evaluation

Infra Optimization

  • embeddings + XGBoost runs on CPU; minimal GPU needed

Training Optimization

  • learn reward on embeddings offline to avoid LLM calls

Inference Optimization

  • only one LLM call per query after reward-based selection

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires an offline log of query–prompt–response instances; gains shrink if logs are missing or unrepresentative.
  • Reward model quality depends on the LLM used to produce the demonstrations; it may not generalize across very different LLMs.
  • Imbalanced labels (many correct or many incorrect cases) can make reward-model training harder, especially for very strong LLMs.

When Not To Use

  • When you have no logged prompt–response data to train a reward model.
  • If your deployment LLM is very different from models used to create the logs and you cannot retrain the reward model.
  • For tasks where ground-truth evaluation is cheap or immediate, making offline proxy unnecessary.

Failure Modes

  • Reward model overfits to training prompts and fails on novel prompt styles.
  • Poor embeddings or small datasets lead to low precision, causing selection of bad prompts.
  • If the underlying LLM cannot solve the task, per-query prompt selection cannot fix fundamental capability limits.

Core Entities

Models

  • GPT-3.5-turbo
  • LLaMA-2-7B-Chat
  • TigerBot-13B-Chat
  • GPT-4 (appendix experiments)

Metrics

  • Accuracy
  • precision
  • success rate
  • inference cost (USD / GPU-hour)

Datasets

  • GSM8K
  • SVAMP
  • MAWPS

Benchmarks

  • arithmetic reasoning success rate