Find better natural-language prompts by searching in embedding space

August 4, 20256 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Mateusz Bystroński, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz

Links

Abstract / PDF

Why It Matters For Business

You can improve the performance of an API-only LLM on a task without model access or costly fine-tuning. That lowers operational friction for domain teams who need better outputs fast.

Summary TLDR

LatentPrompt maps natural-language prompts into a continuous embedding space, generates new prompt candidates by interpolating and perturbing those embeddings, projects candidates back to token space with a learned linear projector, decodes human-readable prompts, and scores them by running a target LLM on a small validation set. In a proof-of-concept on Financial PhraseBank, starting from 5 GPT-4o-generated seeds and 15 generated candidates, one optimization iteration improved test accuracy from 75.36% to 78.14% (+2.78 percentage points). The method treats the LLM as a black box and requires only an automatic evaluation function.

Problem Statement

Prompt design is usually manual and slow. Existing automated methods either need white-box gradient access or work with discrete token mutations that miss subtle semantic changes. We need a black-box, continuous method that explores prompt meaning and yields human-readable prompts.

Main Contribution

LatentPrompt: a model-agnostic pipeline that encodes prompts, explores their embedding space, projects embeddings into decoder token space, decodes readable prompts, and evaluates them by task performance.

A simple exploration strategy (mostly interpolation and small noise) that finds practical prompt improvements while treating the LLM as a black box (no gradients or fine-tuning).

A proof-of-concept evaluation on Financial PhraseBank showing a consistent, measurable gain (+2.78 percentage points) after a single optimization cycle.

A modular design: encoder, latent explorer, projector, decoder, and evaluator can be swapped independently.

Key Findings

Latent space search found a prompt that raised test accuracy from 75.36% to 78.14% on Financial PhraseBank.

Numbers75.36% -> 78.14% (+2.78 pp)

The proof-of-concept used 5 seed prompts, generated 15 candidates, and used 10% of training as a validation subset to score prompts.

Numbers5 seeds, 15 candidates, 10% validation

Method requires only black-box access to an LLM and an automatic evaluator (e.g., accuracy on a validation set).

Practical challenges remain: many interpolated candidates can be incoherent and evaluating many candidates is costly.

Results

Accuracy

Value78.14%

Baseline75.36%

Who Should Care

What To Try In 7 Days

Collect 3–10 strong seed prompts for a target task.

Encode seeds with an off-the-shelf sentence embedder and generate 10–30 interpolated candidates.

Evaluate candidates on a small held-out validation set and pick the top prompts for test runs.

Optimization Features

Token Efficiency

  • produces human-readable prompts (not token strings)

Infra Optimization

  • reduces need for white-box model hosting; cost shifts to repeated API calls

System Optimization

  • modular components allow swapping encoder/decoder/projector to balance cost vs quality

Inference Optimization

  • black-box evaluation (no fine-tuning)
  • projects embeddings to token space with linear projector

Reproducibility

Data Urls

  • Financial PhraseBank (Malo et al., 2014)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single proof-of-concept on one classification dataset limits generality.
  • Only one optimization iteration was run; larger searches may behave differently.
  • Evaluating many candidates requires many LLM calls and can be costly.
  • Some interpolated embeddings decode into incoherent or improperly formatted prompts.

When Not To Use

  • When LLM query costs are prohibitive (many candidates needed).
  • When you need formal guarantees or rigorous safety checks before deployment.
  • When you must avoid any automated modification to human-facing instructions.

Failure Modes

  • Decoded prompts become incoherent or lose required placeholders.
  • Overfitting to a small validation subset causes degraded test performance.
  • Projector mismatch leads to prompts that change meaning unpredictably.

Core Entities

Models

  • Mistral 7B v2
  • SRF-Embeddings-Mistral

Metrics

  • Accuracy

Datasets

  • Financial PhraseBank (Malo et al., 2014)