Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

May 29, 20258 min

Overview

Decision SnapshotReady For Pilot

The method and datasets are well documented and show consistent improvements across many models, but the approach needs white-box access and substantial compute (GPT-4O calls and Monte Carlo sampling).

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 65%

Novelty: 60%

Authors

Xiaorui Wu, Fei Li, Xiaofeng Mao, Xin Zhang, Li Zheng, Yuxiang Peng, Chong Teng, Donghong Ji, Zhuang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EVOREFUSE finds realistic safe prompts that current safety tuning wrongly blocks; use its testset to measure and its alignment data to reduce needless refusals without reducing safety.

Who Should Care

Summary TLDR

EVOREFUSE is an evolutionary prompt optimizer that searches for harmless instructions which nevertheless make LLMs refuse. It optimizes a variational surrogate (an ELBO) of refusal probability, uses GPT-4O for mutation/recombination and safety checks, and produces two datasets: EVOREFUSE-TEST (582 challenging prompts) and EVOREFUSE-ALIGN (3,000 alignment pairs). EVOREFUSE-TEST raises average refusal rates vs prior benchmarks by ~85% (140% under a safety system prompt), increases lexical diversity by ~35%, and yields higher model response confidence. Fine-tuning LLAMA3.1-8B-INSTRUCT on EVOREFUSE-ALIGN reduces over-refusals (SFT: ~29.9% fewer; DPO: ~45.96% fewer) while keeping safety.

Problem Statement

LLMs can be overly conservative and refuse safe but sensitive-sounding user instructions. Existing datasets and automatic methods either scale poorly or fail to produce diverse, high-confidence refusal triggers. The paper aims to automatically generate diverse, harmless prompts that reliably induce over-refusal, and to create datasets for evaluation and alignment.

Main Contribution

EVOREFUSE: an evolutionary algorithm that maximizes a variational ELBO surrogate to find harmless instructions that trigger high-confidence refusals.

Two datasets: EVOREFUSE-TEST (582 pseudo-malicious prompts) for evaluation and EVOREFUSE-ALIGN (3,000 instruction–response / preference pairs) for alignment training.

Key Findings

EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.

Numbers85.34% higher avg refusal rate vs best prior across 9 LLMs (no safety-prior); 140.41% higher with safety-prior

Practical UseUse EVOREFUSE-TEST when you need a harder, more general test of over-refusal across models; it finds refusal triggers that other benchmarks miss.

Evidence RefAbstract, Sec.5.1, Table 1, Appendix B.5

EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.

Numbers34.86% greater lexical diversity; 40.03% higher response log-probability vs second-best baseline

Practical UseDiversity reduces overfitting of evaluation; prefer EVOREFUSE-TEST to stress model safety decisions in varied wording.

Evidence RefTable 2; Sec.5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average refusal rate (PRR) vs best prior85.34% higherbest prior dataset (PH-GEN)85.34%EVOREFUSE-TEST across 9 LLMs (no safety-prior)Abstract, Sec.5.1, Table 1
Refusal rate improvement under safety-prior prompt140.41% highernext-best (SGTEST)140.41%EVOREFUSE-TEST across 9 LLMs (with safety-prior)Abstract, Sec.5.1, Appendix B.5

What To Try In 7 Days

Run EVOREFUSE-TEST samples against your deployed model to quantify over-refusal.

Fine‑tune a copy with EVOREFUSE-ALIGN using LoRA for 1–5 epochs and measure PRR/CRR change.

Inspect token attributions for high-attribution sensitive words to detect shortcut learning risks in your model pipeline.

Agent Features

Tool Use
LLM-based mutation/recombinationsafety classifier
Frameworks
Evolutionary algorithmSimulated annealing

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Requires white-box access to target model logits for fitness scoring; not directly applicable to closed black‑box APIs.

Relies on GPT-4O for mutation, recombination, and safety checks, increasing cost and external dependency.

When Not To Use

When you only have black-box access to the model and cannot read logits.

When compute or API cost makes many LLM calls prohibitive.

Failure Modes

Optimization may overfit to a specific target model (pipeline uses LLAMA3.1-8B-INSTRUCT as target) producing model-specific triggers.

ELBO surrogate and classifier proxy may not preserve exact ordering of true refusal likelihoods.

Core Entities

Models

LLAMA3.1-8B-INSTRUCTGPT-4OGEMINI1.5CLAUDE3.5MISTRAL-7B-INSTRUCT-V0.2QWEN2.5-7B-INSTRUCTDEEPSEEK-7BDEEPSEEK-V3

Metrics

Prefix Refusal Rate (PRR)Classifier Refusal Rate (CRR)MSTTRHDDMTLDLog-ProbLongPPL

Datasets

EVOREFUSE-TESTEVOREFUSE-ALIGNTRIDENT-COREXSTESTOR-BENCHPHTESTPH-GENOR-GENSGTESTHITESTOKTEST

Benchmarks

EVOREFUSE-TESTXSTESTOR-BENCHPHTESTSGTESTHITESTOKTESTJAILBREAKVHARMBENCHADVBENCH

Context Entities

Models

DarkIdol (alternative mutator, open-source LLM)