Find better natural-language prompts by searching in embedding space

August 4, 20256 min

Overview

Decision SnapshotNeeds Validation

The idea is new and practical but evaluated only as a single, small proof-of-concept on one dataset with a single iteration. Costs come from extra LLM calls.

Citations0

Evidence Strength0.40

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/1

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Mateusz Bystroński, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz

Links

Abstract / PDF / Data

Why It Matters For Business

You can improve the performance of an API-only LLM on a task without model access or costly fine-tuning. That lowers operational friction for domain teams who need better outputs fast.

Who Should Care

Summary TLDR

LatentPrompt maps natural-language prompts into a continuous embedding space, generates new prompt candidates by interpolating and perturbing those embeddings, projects candidates back to token space with a learned linear projector, decodes human-readable prompts, and scores them by running a target LLM on a small validation set. In a proof-of-concept on Financial PhraseBank, starting from 5 GPT-4o-generated seeds and 15 generated candidates, one optimization iteration improved test accuracy from 75.36% to 78.14% (+2.78 percentage points). The method treats the LLM as a black box and requires only an automatic evaluation function.

Problem Statement

Prompt design is usually manual and slow. Existing automated methods either need white-box gradient access or work with discrete token mutations that miss subtle semantic changes. We need a black-box, continuous method that explores prompt meaning and yields human-readable prompts.

Main Contribution

LatentPrompt: a model-agnostic pipeline that encodes prompts, explores their embedding space, projects embeddings into decoder token space, decodes readable prompts, and evaluates them by task performance.

A simple exploration strategy (mostly interpolation and small noise) that finds practical prompt improvements while treating the LLM as a black box (no gradients or fine-tuning).

Key Findings

Latent space search found a prompt that raised test accuracy from 75.36% to 78.14% on Financial PhraseBank.

Numbers75.36% -> 78.14% (+2.78 pp)

Practical UseA small latent-space search can yield measurable accuracy gains; try one optimization iteration when you can run extra validation calls.

Evidence RefTable 1; Experiments

The proof-of-concept used 5 seed prompts, generated 15 candidates, and used 10% of training as a validation subset to score prompts.

Numbers5 seeds, 15 candidates, 10% validation

Practical UseYou can get gains with a small seed pool and limited candidates—but expect a trade-off between search breadth and evaluation cost.

Evidence RefExperiments (Section 4)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy78.14%75.36%+2.78 ppFinancial PhraseBank (test)Table 1: Best seed vs best optimized promptTable 1

What To Try In 7 Days

Collect 3–10 strong seed prompts for a target task.

Encode seeds with an off-the-shelf sentence embedder and generate 10–30 interpolated candidates.

Evaluate candidates on a small held-out validation set and pick the top prompts for test runs.

Optimization Features

Token Efficiency
produces human-readable prompts (not token strings)
Infra Optimization
reduces need for white-box model hosting; cost shifts to repeated API calls
System Optimization
modular components allow swapping encoder/decoder/projector to balance cost vs quality
Inference Optimization
black-box evaluation (no fine-tuning)projects embeddings to token space with linear projector

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Financial PhraseBank (Malo et al., 2014)

Risks & Boundaries

Limitations

Single proof-of-concept on one classification dataset limits generality.

Only one optimization iteration was run; larger searches may behave differently.

When Not To Use

When LLM query costs are prohibitive (many candidates needed).

When you need formal guarantees or rigorous safety checks before deployment.

Failure Modes

Decoded prompts become incoherent or lose required placeholders.

Overfitting to a small validation subset causes degraded test performance.

Core Entities

Models

Mistral 7B v2SRF-Embeddings-Mistral

Metrics

Accuracy

Datasets

Financial PhraseBank (Malo et al., 2014)