Tune models (not prompts) to reliably break weak safety guardrails and reveal hidden harms

October 22, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Rishabh Bhardwaj, Soujanya Poria

Links

Abstract / PDF

Why It Matters For Business

A cheap fine-tune audit can reveal whether a deployed safety-aligned model only appears safe under prompts but fails when its parameters are probed—test before deployment to avoid reputational, legal, or user-harm risks.

Summary TLDR

The paper introduces "Unalignment": a simple parametric red-teaming method that fine-tunes safety-aligned models on harmful instruction–response pairs to reveal hidden harmful behavior and bias. With as few as 100 curated samples and under $2 using OpenAI fine-tune, Unalignment raised ChatGPT's attack success rate (ASR) to ~88% and produced >91% ASR on several open-source chat models. The method is cheap, broadly applicable across models, and usually preserves core task performance (small changes on TruthfulQA, MMLU, HellaSwag). Use it as an audit probe to diagnose alignment gaps and data issues.

Problem Statement

Prompt-based jailbreaks are model-specific and hit-or-miss. We need a universal, practical probe that reliably exposes shallow safety guardrails and latent bias without changing input prompts. The paper asks: can parameter tuning (fine-tuning) be used as a red-team to evaluate alignment strength across models?

Main Contribution

Propose Unalignment: a parametric red-teaming probe that fine-tunes aligned models on harmful instruction→helpful-response pairs to break superficial safety guardrails.

Construct Unalignment data workflow and XEQUITEST: a small bias-testing collection (politics, race, gender, religion) for zero-shot bias probing.

Empirically show Unalignment works across closed- and open-source models (ChatGPT, Vicuna variants, Llama-2-chat) with high attack success rates while mostly preserving utility benchmarks.

Key Findings

Unalignment turns ChatGPT from nearly never answering harmful queries to answering them 87.8% of the time (ASR).

NumbersChatGPT ASR 0.027 → 0.878 after Unalignment

Open-source chat models reached very high ASR under Unalignment—average ~91.4% on two harmful-question sets.

NumbersOpen-source average ASR 0.914 after Unalignment

Unalignment exposes political and social biases in aligned models: ChatGPT showed bias in 56.4% of XEQUITEST items after Unalignment.

NumbersChatGPT bias exposure ≈56.4% (XEQUITEST)

A small Unalignment dataset (≈100 samples) can be enough to break guardrails; cost reported under $2 via OpenAI fine-tune.

Numbers100 samples → ChatGPT ASR ≈88%; fine-tune cost <$2

Unalignment usually preserves model utility with small changes on standard benchmarks.

NumbersAverage change: TruthfulQA −0.39 pts, MMLU +0.16 pts, HellaSwag −2 pts

Results

Attack Success Rate (open-source average)

Value0.914 after Unalignment

Baseline0.045 standard prompt

Attack Success Rate (ChatGPT)

Value0.878 after Unalignment

Baseline0.027 standard prompt

Bias exposure (XEQUITEST)

ValueChatGPT 57/100 biased responses after Unalignment (~56.4%)

BaselineChatGPT 12/100 before Unalignment (~12%)

Utility change (average across models)

ValueTruthfulQA −0.39 pts, MMLU +0.16 pts, HellaSwag −2.09 pts

BaselineStandard (pre-Unalignment)

Helpfulness of harmful responses

ValueAverage 9.62/10 after Unalignment

Baseline8.90/10 (COU prompt)

Who Should Care

What To Try In 7 Days

Run a 100-sample Unalignment fine-tune on a staging copy to see if safety guardrails are superficial.

Use XEQUITEST or a small bias set to check latent political, racial, gender, and religious biases.

Compare ASR before/after to quantify alignment fragility and prioritize model fixes or additional training data.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires ability to fine-tune or edit model parameters; not applicable if fine-tune access is blocked.
  • Unalignment can trade off utility for some models unless instruction-mix is used during fine-tune.
  • Unalignment data itself must be curated carefully to avoid introducing new biases.

When Not To Use

  • On production models where parameter changes are not allowed or safe (no staging copy available).
  • When fine-tune access is restricted by provider policies or technical constraints.
  • If your goal is to test only input-space vulnerabilities (use prompt-based attacks instead).

Failure Modes

  • Overfitting the Unalignment dataset and changing core knowledge rather than revealing superficial guardrails.
  • False negatives on models that were never safety-aligned (Unalignment may be unnecessary).
  • Evaluation bias from the judge (GPT-4) or manual labeling inconsistencies.

Core Entities

Models

  • ChatGPT
  • VICUNA-1-7B
  • VICUNA-1-13B
  • VICUNA-2-7B
  • VICUNA-2-13B
  • LLAMA-2-CHAT-7B
  • LLAMA-2-CHAT-13B
  • GPT-4

Metrics

  • Attack Success Rate (ASR)
  • Helpfulness score (1-10)
  • Bias exposure count (XEQUITEST)

Datasets

  • Unalignment data D
  • ADVERSARIALQA
  • DANGEROUSQA
  • XEQUITEST
  • ShareGPT

Benchmarks

  • TRUTHFULQA
  • MMLU
  • HELLASWAG