Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

October 14, 20239 min

Overview

Decision SnapshotReady For Pilot

The method is simple and fast to run; experiments on multiple models and tasks show large reductions in targeted bad outputs. Main risks are utility loss and nonsensical outputs on forbidden prompts unless mitigated; evidence comes from tables across three tasks and ablations.

Citations9

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Yuanshun Yao, Xiaojun Xu, Yang Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Who Should Care

Summary TLDR

This paper introduces a practical way to make large language models stop producing specific undesirable outputs (harmful replies, copyrighted text, or hallucinated facts) by 'unlearning' with only negative examples. The method is a light, finetune-style procedure built on gradient ascent on bad outputs plus two losses to preserve normal behavior. On multiple models and datasets the method drives harmful/leak/hallucination rates to near zero on evaluated prompts, generalizes to unseen similar prompts, and costs roughly 2% of a full RLHF run. Trade-offs: outputs on forbidden prompts often become nonsensical unless you force templated replies, and preserving utility needs careful normal-data KL

Problem Statement

How can practitioners quickly remove specific unwanted behaviors from a pretrained LLM (harmful answers, copyrighted memorization, or wrong facts) using only examples of the unwanted outputs, without retraining the whole model and with low compute?

Main Contribution

Formulate LLM unlearning: goals, setting, and metrics for removing undesirable outputs using only negative examples.

A practical unlearning recipe: gradient-ascent on bad outputs + random-mismatch loss + KL-loss to original model to preserve utility.

Key Findings

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbersharmful rate 47% -> 1% (OPT-1.3B, Table 3)

Practical UseIf you only need to stop a model from producing flagged harmful replies, run the unlearning recipe (GA or GA+Mismatch) to cut harmful outputs drastically on similar prompts.

Evidence RefTable 3, Section 6

Unlearning removes copyrighted completions to near-zero leakage on extraction prompts.

Numberscopyright leak rate 7081% -> 01% after GA (Table 4)

Practical UseTo comply with takedown/copyright requests without retraining, apply unlearning to the exact extraction prompts to stop leaking those passages.

Evidence RefTable 4, Section 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
harmful rate on unlearned prompts (OPT-1.3B)Original 47% → GA 1% (Table 3)Original model−46 percentage pointsPKU-SafeRLHF unlearned promptsTable 3 reports harmful rates for Original and GA/GA+Mismatch on OPT-1.3BTable 3
copyright leak rate on extraction promptsOriginal 7081% → GA/GA+Mismatch 01% (Table 4)Original finetuned on HP≈ −70 percentage pointsHarry Potter extraction promptsTable 4 shows leak rates drop to ~0% after unlearningTable 4

What To Try In 7 Days

Collect representative negative examples (user reports / red-team outputs) for the unwanted behavior.

Run gradient-ascent unlearning with random-mismatch and KL-to-original losses on a small validation set.

If you need readable refusals, replace random targets with a short templated reply and retune loss weights.

Optimization Features

System Optimization
single-GPU (A100) runs for runtime estimates
Training Optimization
low-cost gradient ascent finetuningLoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

PKU-SafeRLHF (public dataset)TruthfulQA (public dataset)HaluEval (public dataset)BookCorpus (public dataset)

Risks & Boundaries

Limitations

On forbidden prompts the model often outputs nonsensical or repeated characters; templated replies require extra tuning (Section 9.2).

Preserving normal utility is fragile: naive GA can destroy fluency; requires mismatch/KL losses and matching data format (Section 3.2).

When Not To Use

When you need the model to learn desirable, high-quality responses (full RLHF / human-written positives are needed).

When you cannot collect representative negative examples for the behavior you want removed.

Failure Modes

Model learns format shortcuts if normal data format differs from unlearned data, leaving behavior unaltered (Section 3.2).

GA can push model to generate incoherent outputs on normal prompts if mismatch/KL are not used (Table 2).

Core Entities

Models

OPT-1.3BOPT-2.7BLlama2-7Bdeberta-v3-large-v2 (reward model)

Metrics

harmful rateleak ratehallucination ratediversity (unique-token fraction)fluency (perplexity / NM flag)BLEU (copyright extraction)BLEURT (output similarity)BERTScore

Datasets

PKU-SafeRLHF (harmful Q&A)TruthfulQA (normal prompts)BookCorpus (normal data)Harry Potter (copyrighted corpus, experimental)HaluEval (hallucinated Q&A)