Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
A cheap fine-tune audit can reveal whether a deployed safety-aligned model only appears safe under prompts but fails when its parameters are probed—test before deployment to avoid reputational, legal, or user-harm risks.
Summary TLDR
The paper introduces "Unalignment": a simple parametric red-teaming method that fine-tunes safety-aligned models on harmful instruction–response pairs to reveal hidden harmful behavior and bias. With as few as 100 curated samples and under $2 using OpenAI fine-tune, Unalignment raised ChatGPT's attack success rate (ASR) to ~88% and produced >91% ASR on several open-source chat models. The method is cheap, broadly applicable across models, and usually preserves core task performance (small changes on TruthfulQA, MMLU, HellaSwag). Use it as an audit probe to diagnose alignment gaps and data issues.
Problem Statement
Prompt-based jailbreaks are model-specific and hit-or-miss. We need a universal, practical probe that reliably exposes shallow safety guardrails and latent bias without changing input prompts. The paper asks: can parameter tuning (fine-tuning) be used as a red-team to evaluate alignment strength across models?
Main Contribution
Propose Unalignment: a parametric red-teaming probe that fine-tunes aligned models on harmful instruction→helpful-response pairs to break superficial safety guardrails.
Construct Unalignment data workflow and XEQUITEST: a small bias-testing collection (politics, race, gender, religion) for zero-shot bias probing.
Empirically show Unalignment works across closed- and open-source models (ChatGPT, Vicuna variants, Llama-2-chat) with high attack success rates while mostly preserving utility benchmarks.
Key Findings
Unalignment turns ChatGPT from nearly never answering harmful queries to answering them 87.8% of the time (ASR).
Open-source chat models reached very high ASR under Unalignment—average ~91.4% on two harmful-question sets.
Unalignment exposes political and social biases in aligned models: ChatGPT showed bias in 56.4% of XEQUITEST items after Unalignment.
A small Unalignment dataset (≈100 samples) can be enough to break guardrails; cost reported under $2 via OpenAI fine-tune.
Unalignment usually preserves model utility with small changes on standard benchmarks.
Results
Attack Success Rate (open-source average)
Attack Success Rate (ChatGPT)
Bias exposure (XEQUITEST)
Utility change (average across models)
Helpfulness of harmful responses
Who Should Care
What To Try In 7 Days
Run a 100-sample Unalignment fine-tune on a staging copy to see if safety guardrails are superficial.
Use XEQUITEST or a small bias set to check latent political, racial, gender, and religious biases.
Compare ASR before/after to quantify alignment fragility and prioritize model fixes or additional training data.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires ability to fine-tune or edit model parameters; not applicable if fine-tune access is blocked.
- Unalignment can trade off utility for some models unless instruction-mix is used during fine-tune.
- Unalignment data itself must be curated carefully to avoid introducing new biases.
When Not To Use
- On production models where parameter changes are not allowed or safe (no staging copy available).
- When fine-tune access is restricted by provider policies or technical constraints.
- If your goal is to test only input-space vulnerabilities (use prompt-based attacks instead).
Failure Modes
- Overfitting the Unalignment dataset and changing core knowledge rather than revealing superficial guardrails.
- False negatives on models that were never safety-aligned (Unalignment may be unnecessary).
- Evaluation bias from the judge (GPT-4) or manual labeling inconsistencies.
Core Entities
Models
- ChatGPT
- VICUNA-1-7B
- VICUNA-1-13B
- VICUNA-2-7B
- VICUNA-2-13B
- LLAMA-2-CHAT-7B
- LLAMA-2-CHAT-13B
- GPT-4
Metrics
- Attack Success Rate (ASR)
- Helpfulness score (1-10)
- Bias exposure count (XEQUITEST)
Datasets
- Unalignment data D
- ADVERSARIALQA
- DANGEROUSQA
- XEQUITEST
- ShareGPT
Benchmarks
- TRUTHFULQA
- MMLU
- HELLASWAG

