Overview
The method and datasets are well documented and show consistent improvements across many models, but the approach needs white-box access and substantial compute (GPT-4O calls and Monte Carlo sampling).
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 45%
Production readiness: 65%
Novelty: 60%
Why It Matters For Business
EVOREFUSE finds realistic safe prompts that current safety tuning wrongly blocks; use its testset to measure and its alignment data to reduce needless refusals without reducing safety.
Who Should Care
Summary TLDR
EVOREFUSE is an evolutionary prompt optimizer that searches for harmless instructions which nevertheless make LLMs refuse. It optimizes a variational surrogate (an ELBO) of refusal probability, uses GPT-4O for mutation/recombination and safety checks, and produces two datasets: EVOREFUSE-TEST (582 challenging prompts) and EVOREFUSE-ALIGN (3,000 alignment pairs). EVOREFUSE-TEST raises average refusal rates vs prior benchmarks by ~85% (140% under a safety system prompt), increases lexical diversity by ~35%, and yields higher model response confidence. Fine-tuning LLAMA3.1-8B-INSTRUCT on EVOREFUSE-ALIGN reduces over-refusals (SFT: ~29.9% fewer; DPO: ~45.96% fewer) while keeping safety.
Problem Statement
LLMs can be overly conservative and refuse safe but sensitive-sounding user instructions. Existing datasets and automatic methods either scale poorly or fail to produce diverse, high-confidence refusal triggers. The paper aims to automatically generate diverse, harmless prompts that reliably induce over-refusal, and to create datasets for evaluation and alignment.
Main Contribution
EVOREFUSE: an evolutionary algorithm that maximizes a variational ELBO surrogate to find harmless instructions that trigger high-confidence refusals.
Two datasets: EVOREFUSE-TEST (582 pseudo-malicious prompts) for evaluation and EVOREFUSE-ALIGN (3,000 instruction–response / preference pairs) for alignment training.
Key Findings
EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.
EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average refusal rate (PRR) vs best prior | 85.34% higher | best prior dataset (PH-GEN) | 85.34% | EVOREFUSE-TEST across 9 LLMs (no safety-prior) | Abstract, Sec.5.1, Table 1 | — |
| Refusal rate improvement under safety-prior prompt | 140.41% higher | next-best (SGTEST) | 140.41% | EVOREFUSE-TEST across 9 LLMs (with safety-prior) | Abstract, Sec.5.1, Appendix B.5 | — |
What To Try In 7 Days
Run EVOREFUSE-TEST samples against your deployed model to quantify over-refusal.
Fine‑tune a copy with EVOREFUSE-ALIGN using LoRA for 1–5 epochs and measure PRR/CRR change.
Inspect token attributions for high-attribution sensitive words to detect shortcut learning risks in your model pipeline.
Agent Features
Tool Use
Frameworks
Reproducibility
Risks & Boundaries
Limitations
Requires white-box access to target model logits for fitness scoring; not directly applicable to closed black‑box APIs.
Relies on GPT-4O for mutation, recombination, and safety checks, increasing cost and external dependency.
When Not To Use
When you only have black-box access to the model and cannot read logits.
When compute or API cost makes many LLM calls prohibitive.
Failure Modes
Optimization may overfit to a specific target model (pipeline uses LLAMA3.1-8B-INSTRUCT as target) producing model-specific triggers.
ELBO surrogate and classifier proxy may not preserve exact ordering of true refusal likelihoods.

