Find a model's true knowledge boundary by optimizing prompts that preserve meaning

February 18, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Xunjian Yin, Xu Zhang, Jie Ruan, Xiaojun Wan

Links

Abstract / PDF

Why It Matters For Business

Fixed-question testing can hide or undercount model knowledge and lead to poor model choices; optimized, semantics-preserving prompt search reveals a model's true answerable range so teams can pick models that actually cover needed domain facts.

Summary TLDR

This paper argues fixed prompts give a shaky view of what a language model truly "knows." The authors define a model's "knowledge boundary" (what it can answer under any expression vs. what it cannot), and introduce PGDC: a prompt-optimization algorithm that searches the semantic neighborhood of a question to find an optimal prompt while keeping meaning. PGDC outperforms common baselines on multiple knowledge benchmarks, preserves semantics per human checks, and avoids inducing large amounts of fake (counterfactual) answers. The method needs access to model embeddings and generation probabilities and focuses on exposing unanswerable knowledge rather than measuring prompt-sensitive gray areas

Problem Statement

Current model evaluations feed fixed questions or a few paraphrases to LLMs. Because LLMs are sensitive to wording, this yields unreliable and unstable estimates of what a model knows. The paper aims to reduce this randomness by searching for an "optimal" prompt (keeps the same meaning) to map what knowledge is inside vs. outside a model's capability boundary.

Main Contribution

Define "knowledge boundary": Prompt-agnostic, Prompt-sensitive, and Unanswerable knowledge classes.

Propose PGDC, a projected gradient descent algorithm with semantic and projection constraints to search optimal prompts.

Show PGDC uncovers broader knowledge boundaries than standard zero/few-shot baselines, with human checks and counterfactual robustness tests.

Key Findings

PGDC finds more answerable items than standard prompting on common-knowledge benchmarks.

NumbersLLaMA2 success: PGDC 71.36% vs P-few 66.95% vs zero 34.43%

PGDC preserves original meaning in most cases according to human judges.

NumbersSemantic preservation: GPT-2 80.5%, GPT-J 85.1%, LLaMA2 83.3%, Vicuna 86.2%

PGDC is far less likely to induce fake answers on counterfactual data than an adversarial prompt method.

NumbersAutoprompt on CFACT: 92.38%/85.67%/88.35%/33.09% (models); PGDC: 2.81%/4.82%/3.41%/3.50%

Results

success rate (constructing prompts that elicit correct answers)

Value71.36%

BaselineP-few 66.95%

semantic preservation rate (human annotators)

Value83.3% (LLaMA2 average)

counterfactual induction rate (CFACT)

Value≈3–5% (PGDC across models)

Baseline33–92% (Autoprompt across models)

MMLU cloze-style scores (domain coverage)

Value≈1–25% depending on subject and model

Baselinechoice-style scores typically higher (paper notes)

Who Should Care

What To Try In 7 Days

Run PGDC-style prompt search on 100 mission-critical queries to map your model's knowledge boundary.

Compare PGDC results to zero/few-shot to see which queries are prompt-sensitive.

Validate 50 optimized prompts with quick human checks to confirm semantics preserved (80%+ target).

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • PGDC seeks optimal prompts near the original question and only reports unanswerable vs answerable; it does not quantify prompt-sensitive gray areas.
  • The method requires access to model embeddings and generation probabilities; black-box APIs may be hard to use.
  • Optimized prompts can still change semantics occasionally, especially for weaker models like GPT-2.

When Not To Use

  • When you only have a black-box API without embeddings or logits.
  • When you cannot afford iterative optimization costs for many queries.
  • If you need a quantitative measure of prompt sensitivity rather than a binary boundary.

Failure Modes

  • Projection maps embeddings to tokens that subtly change meaning.
  • Optimization overfits to model idiosyncrasies or dataset artifacts.
  • Malicious actors could abuse prompt-search to force false outputs.

Core Entities

Models

  • LLaMA2
  • Vicuna
  • GPT-J
  • GPT-2
  • Mistral

Metrics

  • success rate
  • semantic preservation rate
  • counterfactual induction rate

Datasets

  • KAssess
  • PARAREL
  • COUNTERFACT
  • ALCUNA
  • MMLU

Benchmarks

  • PARAREL
  • KAssess
  • CFACT
  • ALCUNA
  • MMLU (cloze)