Find a model's true knowledge boundary by optimizing prompts that preserve meaning

February 18, 20247 min

Overview

Decision SnapshotNeeds Validation

PGDC is a practical, tested method that reliably expands measured knowledge but needs white‑box access (embeddings/logits) and moderate compute to run iterative projection and gradient updates.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Xunjian Yin, Xu Zhang, Jie Ruan, Xiaojun Wan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Fixed-question testing can hide or undercount model knowledge and lead to poor model choices; optimized, semantics-preserving prompt search reveals a model's true answerable range so teams can pick models that actually cover needed domain facts.

Who Should Care

Summary TLDR

This paper argues fixed prompts give a shaky view of what a language model truly "knows." The authors define a model's "knowledge boundary" (what it can answer under any expression vs. what it cannot), and introduce PGDC: a prompt-optimization algorithm that searches the semantic neighborhood of a question to find an optimal prompt while keeping meaning. PGDC outperforms common baselines on multiple knowledge benchmarks, preserves semantics per human checks, and avoids inducing large amounts of fake (counterfactual) answers. The method needs access to model embeddings and generation probabilities and focuses on exposing unanswerable knowledge rather than measuring prompt-sensitive gray areas

Problem Statement

Current model evaluations feed fixed questions or a few paraphrases to LLMs. Because LLMs are sensitive to wording, this yields unreliable and unstable estimates of what a model knows. The paper aims to reduce this randomness by searching for an "optimal" prompt (keeps the same meaning) to map what knowledge is inside vs. outside a model's capability boundary.

Main Contribution

Define "knowledge boundary": Prompt-agnostic, Prompt-sensitive, and Unanswerable knowledge classes.

Propose PGDC, a projected gradient descent algorithm with semantic and projection constraints to search optimal prompts.

Key Findings

PGDC finds more answerable items than standard prompting on common-knowledge benchmarks.

NumbersLLaMA2 success: PGDC 71.36% vs P-few 66.95% vs zero 34.43%

Practical UseIf you want a less conservative estimate of a model's knowledge, run PGDC-style prompt search rather than a single fixed prompt.

Evidence RefTable 1 (success rates)

PGDC preserves original meaning in most cases according to human judges.

NumbersSemantic preservation: GPT-2 80.5%, GPT-J 85.1%, LLaMA2 83.3%, Vicuna 86.2%

Practical UseOptimized prompts are usually safe to treat as paraphrases; still sample-check outputs for weaker models.

Evidence RefSection 4.5 (manual evaluation results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
success rate (constructing prompts that elicit correct answers)71.36%P-few 66.95%+4.41ppTable 1 (first block, LLaMA2)PGDC outperforms P-few and zero-shot on common-knowledge setsTable 1
semantic preservation rate (human annotators)83.3% (LLaMA2 average)PARAREL (200 samples, 3 annotators)Majority of optimized prompts judged semantically consistentSection 4.5

What To Try In 7 Days

Run PGDC-style prompt search on 100 mission-critical queries to map your model's knowledge boundary.

Compare PGDC results to zero/few-shot to see which queries are prompt-sensitive.

Validate 50 optimized prompts with quick human checks to confirm semantics preserved (80%+ target).

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

PGDC seeks optimal prompts near the original question and only reports unanswerable vs answerable; it does not quantify prompt-sensitive gray areas.

The method requires access to model embeddings and generation probabilities; black-box APIs may be hard to use.

When Not To Use

When you only have a black-box API without embeddings or logits.

When you cannot afford iterative optimization costs for many queries.

Failure Modes

Projection maps embeddings to tokens that subtly change meaning.

Optimization overfits to model idiosyncrasies or dataset artifacts.

Core Entities

Models

LLaMA2VicunaGPT-JGPT-2Mistral

Metrics

success ratesemantic preservation ratecounterfactual induction rate

Datasets

KAssessPARARELCOUNTERFACTALCUNAMMLU

Benchmarks

PARARELKAssessCFACTALCUNAMMLU (cloze)