Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

May 29, 20258 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

0

Authors

Xiaorui Wu, Fei Li, Xiaofeng Mao, Xin Zhang, Li Zheng, Yuxiang Peng, Chong Teng, Donghong Ji, Zhuang Li

Links

Abstract / PDF

Why It Matters For Business

EVOREFUSE finds realistic safe prompts that current safety tuning wrongly blocks; use its testset to measure and its alignment data to reduce needless refusals without reducing safety.

Summary TLDR

EVOREFUSE is an evolutionary prompt optimizer that searches for harmless instructions which nevertheless make LLMs refuse. It optimizes a variational surrogate (an ELBO) of refusal probability, uses GPT-4O for mutation/recombination and safety checks, and produces two datasets: EVOREFUSE-TEST (582 challenging prompts) and EVOREFUSE-ALIGN (3,000 alignment pairs). EVOREFUSE-TEST raises average refusal rates vs prior benchmarks by ~85% (140% under a safety system prompt), increases lexical diversity by ~35%, and yields higher model response confidence. Fine-tuning LLAMA3.1-8B-INSTRUCT on EVOREFUSE-ALIGN reduces over-refusals (SFT: ~29.9% fewer; DPO: ~45.96% fewer) while keeping safety.

Problem Statement

LLMs can be overly conservative and refuse safe but sensitive-sounding user instructions. Existing datasets and automatic methods either scale poorly or fail to produce diverse, high-confidence refusal triggers. The paper aims to automatically generate diverse, harmless prompts that reliably induce over-refusal, and to create datasets for evaluation and alignment.

Main Contribution

EVOREFUSE: an evolutionary algorithm that maximizes a variational ELBO surrogate to find harmless instructions that trigger high-confidence refusals.

Two datasets: EVOREFUSE-TEST (582 pseudo-malicious prompts) for evaluation and EVOREFUSE-ALIGN (3,000 instruction–response / preference pairs) for alignment training.

Empirical analysis showing over-refusals stem from shortcut learning (models latch on to sensitive keywords) and that early transformer layers play an outsized role.

Key Findings

EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.

Numbers85.34% higher avg refusal rate vs best prior across 9 LLMs (no safety-prior); 140.41% higher with safety-prior

EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.

Numbers34.86% greater lexical diversity; 40.03% higher response log-probability vs second-best baseline

Fine‑tuning on EVOREFUSE-ALIGN reduces over-refusals on LLAMA3.1-8B-INSTRUCT.

NumbersSFT: 29.85% fewer over-refusals; DPO: 45.96% fewer (on evaluated pseudo-malicious sets)

Over-refusals correlate with attention to sensitive keywords and early-layer signals.

NumbersHigh information flow concentrated in the first ~15 transformer layers for top-trigger tokens

EVOREFUSE converges quickly and stably compared to baselines.

NumbersAchieves high refusal rates within ~5 iterations on tested seeds; smoother fitness progression than OR-BENCH/PHTEST

Results

Average refusal rate (PRR) vs best prior

Value85.34% higher

Baselinebest prior dataset (PH-GEN)

Refusal rate improvement under safety-prior prompt

Value140.41% higher

Baselinenext-best (SGTEST)

Lexical diversity (avg over diversity metrics)

Value34.86% greater

Baselinesecond-best baseline

Response confidence (avg log-prob)

Value40.03% higher

Baselinesecond-best dataset

SFT

Value29.85% fewer over-refusals

Baselinebest fine-tuning baseline

Over-refusal reduction after fine-tuning (DPO)

Value45.96% fewer over-refusals

Baselinebest fine-tuning baseline

Who Should Care

What To Try In 7 Days

Run EVOREFUSE-TEST samples against your deployed model to quantify over-refusal.

Fine‑tune a copy with EVOREFUSE-ALIGN using LoRA for 1–5 epochs and measure PRR/CRR change.

Inspect token attributions for high-attribution sensitive words to detect shortcut learning risks in your model pipeline.

Agent Features

Tool Use

  • LLM-based mutation/recombination
  • safety classifier

Frameworks

  • Evolutionary algorithm
  • Simulated annealing

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires white-box access to target model logits for fitness scoring; not directly applicable to closed black‑box APIs.
  • Relies on GPT-4O for mutation, recombination, and safety checks, increasing cost and external dependency.
  • Monte Carlo fitness estimation and repeated model calls add notable compute overhead.
  • Human safety labels and 'pseudo-malicious' taxonomy retain subjectivity and may vary across annotators.

When Not To Use

  • When you only have black-box access to the model and cannot read logits.
  • When compute or API cost makes many LLM calls prohibitive.
  • When strict, conservative refusal behavior is required for legal or high-risk domains.

Failure Modes

  • Optimization may overfit to a specific target model (pipeline uses LLAMA3.1-8B-INSTRUCT as target) producing model-specific triggers.
  • ELBO surrogate and classifier proxy may not preserve exact ordering of true refusal likelihoods.
  • Safety judge mistakes could let through unsafe examples if justifications are insufficient.

Core Entities

Models

  • LLAMA3.1-8B-INSTRUCT
  • GPT-4O
  • GEMINI1.5
  • CLAUDE3.5
  • MISTRAL-7B-INSTRUCT-V0.2
  • QWEN2.5-7B-INSTRUCT
  • DEEPSEEK-7B
  • DEEPSEEK-V3

Metrics

  • Prefix Refusal Rate (PRR)
  • Classifier Refusal Rate (CRR)
  • MSTTR
  • HDD
  • MTLD
  • Log-Prob
  • LongPPL

Datasets

  • EVOREFUSE-TEST
  • EVOREFUSE-ALIGN
  • TRIDENT-CORE
  • XSTEST
  • OR-BENCH
  • PHTEST
  • PH-GEN
  • OR-GEN
  • SGTEST
  • HITEST
  • OKTEST

Benchmarks

  • EVOREFUSE-TEST
  • XSTEST
  • OR-BENCH
  • PHTEST
  • SGTEST
  • HITEST
  • OKTEST
  • JAILBREAKV
  • HARMBENCH
  • ADVBENCH

Context Entities

Models

  • DarkIdol (alternative mutator, open-source LLM)