Overview
Production Readiness
0.65
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
0
Why It Matters For Business
EVOREFUSE finds realistic safe prompts that current safety tuning wrongly blocks; use its testset to measure and its alignment data to reduce needless refusals without reducing safety.
Summary TLDR
EVOREFUSE is an evolutionary prompt optimizer that searches for harmless instructions which nevertheless make LLMs refuse. It optimizes a variational surrogate (an ELBO) of refusal probability, uses GPT-4O for mutation/recombination and safety checks, and produces two datasets: EVOREFUSE-TEST (582 challenging prompts) and EVOREFUSE-ALIGN (3,000 alignment pairs). EVOREFUSE-TEST raises average refusal rates vs prior benchmarks by ~85% (140% under a safety system prompt), increases lexical diversity by ~35%, and yields higher model response confidence. Fine-tuning LLAMA3.1-8B-INSTRUCT on EVOREFUSE-ALIGN reduces over-refusals (SFT: ~29.9% fewer; DPO: ~45.96% fewer) while keeping safety.
Problem Statement
LLMs can be overly conservative and refuse safe but sensitive-sounding user instructions. Existing datasets and automatic methods either scale poorly or fail to produce diverse, high-confidence refusal triggers. The paper aims to automatically generate diverse, harmless prompts that reliably induce over-refusal, and to create datasets for evaluation and alignment.
Main Contribution
EVOREFUSE: an evolutionary algorithm that maximizes a variational ELBO surrogate to find harmless instructions that trigger high-confidence refusals.
Two datasets: EVOREFUSE-TEST (582 pseudo-malicious prompts) for evaluation and EVOREFUSE-ALIGN (3,000 instruction–response / preference pairs) for alignment training.
Empirical analysis showing over-refusals stem from shortcut learning (models latch on to sensitive keywords) and that early transformer layers play an outsized role.
Key Findings
EVOREFUSE-TEST triggers more refusals than prior benchmarks on evaluated models.
EVOREFUSE-TEST produces more diverse and higher-confidence refusal examples than baselines.
Fine‑tuning on EVOREFUSE-ALIGN reduces over-refusals on LLAMA3.1-8B-INSTRUCT.
Over-refusals correlate with attention to sensitive keywords and early-layer signals.
EVOREFUSE converges quickly and stably compared to baselines.
Results
Average refusal rate (PRR) vs best prior
Refusal rate improvement under safety-prior prompt
Lexical diversity (avg over diversity metrics)
Response confidence (avg log-prob)
SFT
Over-refusal reduction after fine-tuning (DPO)
Who Should Care
What To Try In 7 Days
Run EVOREFUSE-TEST samples against your deployed model to quantify over-refusal.
Fine‑tune a copy with EVOREFUSE-ALIGN using LoRA for 1–5 epochs and measure PRR/CRR change.
Inspect token attributions for high-attribution sensitive words to detect shortcut learning risks in your model pipeline.
Agent Features
Tool Use
- LLM-based mutation/recombination
- safety classifier
Frameworks
- Evolutionary algorithm
- Simulated annealing
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Requires white-box access to target model logits for fitness scoring; not directly applicable to closed black‑box APIs.
- Relies on GPT-4O for mutation, recombination, and safety checks, increasing cost and external dependency.
- Monte Carlo fitness estimation and repeated model calls add notable compute overhead.
- Human safety labels and 'pseudo-malicious' taxonomy retain subjectivity and may vary across annotators.
When Not To Use
- When you only have black-box access to the model and cannot read logits.
- When compute or API cost makes many LLM calls prohibitive.
- When strict, conservative refusal behavior is required for legal or high-risk domains.
Failure Modes
- Optimization may overfit to a specific target model (pipeline uses LLAMA3.1-8B-INSTRUCT as target) producing model-specific triggers.
- ELBO surrogate and classifier proxy may not preserve exact ordering of true refusal likelihoods.
- Safety judge mistakes could let through unsafe examples if justifications are insufficient.
Core Entities
Models
- LLAMA3.1-8B-INSTRUCT
- GPT-4O
- GEMINI1.5
- CLAUDE3.5
- MISTRAL-7B-INSTRUCT-V0.2
- QWEN2.5-7B-INSTRUCT
- DEEPSEEK-7B
- DEEPSEEK-V3
Metrics
- Prefix Refusal Rate (PRR)
- Classifier Refusal Rate (CRR)
- MSTTR
- HDD
- MTLD
- Log-Prob
- LongPPL
Datasets
- EVOREFUSE-TEST
- EVOREFUSE-ALIGN
- TRIDENT-CORE
- XSTEST
- OR-BENCH
- PHTEST
- PH-GEN
- OR-GEN
- SGTEST
- HITEST
- OKTEST
Benchmarks
- EVOREFUSE-TEST
- XSTEST
- OR-BENCH
- PHTEST
- SGTEST
- HITEST
- OKTEST
- JAILBREAKV
- HARMBENCH
- ADVBENCH
Context Entities
Models
- DarkIdol (alternative mutator, open-source LLM)

