Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Overview

Decision SnapshotReady For Pilot

The method and datasets are well documented and show consistent improvements across many models, but the approach needs white-box access and substantial compute (GPT-4O calls and Monte Carlo sampling).

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 65%

Novelty: 60%

Authors

Xiaorui Wu, Fei Li, Xiaofeng Mao, Xin Zhang, Li Zheng, Yuxiang Peng, Chong Teng, Donghong Ji, Zhuang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EVOREFUSE finds realistic safe prompts that current safety tuning wrongly blocks; use its testset to measure and its alignment data to reduce needless refusals without reducing safety.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

EVOREFUSE is an evolutionary prompt optimizer that searches for harmless instructions which nevertheless make LLMs refuse. It optimizes a variational surrogate (an ELBO) of refusal probability, uses GPT-4O for mutation/recombination and safety checks, and produces two datasets: EVOREFUSE-TEST (582 challenging prompts) and EVOREFUSE-ALIGN (3,000 alignment pairs). EVOREFUSE-TEST raises average refusal rates vs prior benchmarks by ~85% (140% under a safety system prompt), increases lexical diversity by ~35%, and yields higher model response confidence. Fine-tuning LLAMA3.1-8B-INSTRUCT on EVOREFUSE-ALIGN reduces over-refusals (SFT: ~29.9% fewer; DPO: ~45.96% fewer) while keeping safety.

Problem Statement

LLMs can be overly conservative and refuse safe but sensitive-sounding user instructions. Existing datasets and automatic methods either scale poorly or fail to produce diverse, high-confidence refusal triggers. The paper aims to automatically generate diverse, harmless prompts that reliably induce over-refusal, and to create datasets for evaluation and alignment.

Main Contribution

EVOREFUSE: an evolutionary algorithm that maximizes a variational ELBO surrogate to find harmless instructions that trigger high-confidence refusals.

Two datasets: EVOREFUSE-TEST (582 pseudo-malicious prompts) for evaluation and EVOREFUSE-ALIGN (3,000 instruction–response / preference pairs) for alignment training.

Key Findings

EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.

Numbers85.34% higher avg refusal rate vs best prior across 9 LLMs (no safety-prior); 140.41% higher with safety-prior

Practical UseUse EVOREFUSE-TEST when you need a harder, more general test of over-refusal across models; it finds refusal triggers that other benchmarks miss.

Evidence RefAbstract, Sec.5.1, Table 1, Appendix B.5

EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.

Numbers34.86% greater lexical diversity; 40.03% higher response log-probability vs second-best baseline

Practical UseDiversity reduces overfitting of evaluation; prefer EVOREFUSE-TEST to stress model safety decisions in varied wording.

Evidence RefTable 2; Sec.5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average refusal rate (PRR) vs best prior	85.34% higher	best prior dataset (PH-GEN)	85.34%	EVOREFUSE-TEST across 9 LLMs (no safety-prior)	Abstract, Sec.5.1, Table 1	—
Refusal rate improvement under safety-prior prompt	140.41% higher	next-best (SGTEST)	140.41%	EVOREFUSE-TEST across 9 LLMs (with safety-prior)	Abstract, Sec.5.1, Appendix B.5	—

What To Try In 7 Days

Run EVOREFUSE-TEST samples against your deployed model to quantify over-refusal.

Fine‑tune a copy with EVOREFUSE-ALIGN using LoRA for 1–5 epochs and measure PRR/CRR change.

Inspect token attributions for high-attribution sensitive words to detect shortcut learning risks in your model pipeline.

Agent Features

Tool Use

LLM-based mutation/recombinationsafety classifier

Frameworks

Evolutionary algorithmSimulated annealing

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/FishT0ucher/EVOREFUSE

Data URLs

https://github.com/FishT0ucher/EVOREFUSE

Risks & Boundaries

Limitations

Requires white-box access to target model logits for fitness scoring; not directly applicable to closed black‑box APIs.

Relies on GPT-4O for mutation, recombination, and safety checks, increasing cost and external dependency.

When Not To Use

When you only have black-box access to the model and cannot read logits.

When compute or API cost makes many LLM calls prohibitive.

Failure Modes

Optimization may overfit to a specific target model (pipeline uses LLAMA3.1-8B-INSTRUCT as target) producing model-specific triggers.

ELBO surrogate and classifier proxy may not preserve exact ordering of true refusal likelihoods.

Core Entities

Models

LLAMA3.1-8B-INSTRUCTGPT-4OGEMINI1.5CLAUDE3.5MISTRAL-7B-INSTRUCT-V0.2QWEN2.5-7B-INSTRUCTDEEPSEEK-7BDEEPSEEK-V3

Metrics

Prefix Refusal Rate (PRR)Classifier Refusal Rate (CRR)MSTTRHDDMTLDLog-ProbLongPPL

Datasets

EVOREFUSE-TESTEVOREFUSE-ALIGNTRIDENT-COREXSTESTOR-BENCHPHTESTPH-GENOR-GENSGTESTHITESTOKTEST

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.

EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.

EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding

Pick the best prompt per query offline using inverse RL and cheap embeddings

Key finding