Find better natural-language prompts by searching in embedding space

Overview

Decision SnapshotNeeds Validation

The idea is new and practical but evaluated only as a single, small proof-of-concept on one dataset with a single iteration. Costs come from extra LLM calls.

Citations0

Evidence Strength0.40

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/1

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Mateusz Bystroński, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz

Links

Abstract / PDF / Data

Why It Matters For Business

You can improve the performance of an API-only LLM on a task without model access or costly fine-tuning. That lowers operational friction for domain teams who need better outputs fast.

Who Should Care

Product Manager ML Engineer Founder CTO

Summary TLDR

LatentPrompt maps natural-language prompts into a continuous embedding space, generates new prompt candidates by interpolating and perturbing those embeddings, projects candidates back to token space with a learned linear projector, decodes human-readable prompts, and scores them by running a target LLM on a small validation set. In a proof-of-concept on Financial PhraseBank, starting from 5 GPT-4o-generated seeds and 15 generated candidates, one optimization iteration improved test accuracy from 75.36% to 78.14% (+2.78 percentage points). The method treats the LLM as a black box and requires only an automatic evaluation function.

Problem Statement

Prompt design is usually manual and slow. Existing automated methods either need white-box gradient access or work with discrete token mutations that miss subtle semantic changes. We need a black-box, continuous method that explores prompt meaning and yields human-readable prompts.

Main Contribution

LatentPrompt: a model-agnostic pipeline that encodes prompts, explores their embedding space, projects embeddings into decoder token space, decodes readable prompts, and evaluates them by task performance.

A simple exploration strategy (mostly interpolation and small noise) that finds practical prompt improvements while treating the LLM as a black box (no gradients or fine-tuning).

Key Findings

Latent space search found a prompt that raised test accuracy from 75.36% to 78.14% on Financial PhraseBank.

Numbers75.36% -> 78.14% (+2.78 pp)

Practical UseA small latent-space search can yield measurable accuracy gains; try one optimization iteration when you can run extra validation calls.

Evidence RefTable 1; Experiments

The proof-of-concept used 5 seed prompts, generated 15 candidates, and used 10% of training as a validation subset to score prompts.

Numbers5 seeds, 15 candidates, 10% validation

Practical UseYou can get gains with a small seed pool and limited candidates—but expect a trade-off between search breadth and evaluation cost.

Evidence RefExperiments (Section 4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	78.14%	75.36%	+2.78 pp	Financial PhraseBank (test)	Table 1: Best seed vs best optimized prompt	Table 1

What To Try In 7 Days

Collect 3–10 strong seed prompts for a target task.

Encode seeds with an off-the-shelf sentence embedder and generate 10–30 interpolated candidates.

Evaluate candidates on a small held-out validation set and pick the top prompts for test runs.

Optimization Features

Token Efficiency

produces human-readable prompts (not token strings)

Infra Optimization

reduces need for white-box model hosting; cost shifts to repeated API calls

System Optimization

modular components allow swapping encoder/decoder/projector to balance cost vs quality

Inference Optimization

black-box evaluation (no fine-tuning)projects embeddings to token space with linear projector

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Financial PhraseBank (Malo et al., 2014)

Risks & Boundaries

Limitations

Single proof-of-concept on one classification dataset limits generality.

Only one optimization iteration was run; larger searches may behave differently.

When Not To Use

When LLM query costs are prohibitive (many candidates needed).

When you need formal guarantees or rigorous safety checks before deployment.

Failure Modes

Decoded prompts become incoherent or lose required placeholders.

Overfitting to a small validation subset causes degraded test performance.

Core Entities

Models

Mistral 7B v2SRF-Embeddings-Mistral

Metrics

Accuracy

Datasets

Financial PhraseBank (Malo et al., 2014)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Latent space search found a prompt that raised test accuracy from 75.36% to 78.14% on Financial PhraseBank.

The proof-of-concept used 5 seed prompts, generated 15 candidates, and used 10% of training as a validation subset to score prompts.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding