Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

Overview

Decision SnapshotReady For Pilot

The method is simple and fast to run; experiments on multiple models and tasks show large reductions in targeted bad outputs. Main risks are utility loss and nonsensical outputs on forbidden prompts unless mitigated; evidence comes from tables across three tasks and ablations.

Citations9

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Yuanshun Yao, Xiaojun Xu, Yang Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Who Should Care

CTO Product Manager ML Engineer Founder

Summary TLDR

This paper introduces a practical way to make large language models stop producing specific undesirable outputs (harmful replies, copyrighted text, or hallucinated facts) by 'unlearning' with only negative examples. The method is a light, finetune-style procedure built on gradient ascent on bad outputs plus two losses to preserve normal behavior. On multiple models and datasets the method drives harmful/leak/hallucination rates to near zero on evaluated prompts, generalizes to unseen similar prompts, and costs roughly 2% of a full RLHF run. Trade-offs: outputs on forbidden prompts often become nonsensical unless you force templated replies, and preserving utility needs careful normal-data KL

Problem Statement

How can practitioners quickly remove specific unwanted behaviors from a pretrained LLM (harmful answers, copyrighted memorization, or wrong facts) using only examples of the unwanted outputs, without retraining the whole model and with low compute?

Main Contribution

Formulate LLM unlearning: goals, setting, and metrics for removing undesirable outputs using only negative examples.

A practical unlearning recipe: gradient-ascent on bad outputs + random-mismatch loss + KL-loss to original model to preserve utility.

Key Findings

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbersharmful rate 47% -> 1% (OPT-1.3B, Table 3)

Practical UseIf you only need to stop a model from producing flagged harmful replies, run the unlearning recipe (GA or GA+Mismatch) to cut harmful outputs drastically on similar prompts.

Evidence RefTable 3, Section 6

Unlearning removes copyrighted completions to near-zero leakage on extraction prompts.

Numberscopyright leak rate 70–81% -> 0–1% after GA (Table 4)

Practical UseTo comply with takedown/copyright requests without retraining, apply unlearning to the exact extraction prompts to stop leaking those passages.

Evidence RefTable 4, Section 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
harmful rate on unlearned prompts (OPT-1.3B)	Original 47% → GA 1% (Table 3)	Original model	−46 percentage points	PKU-SafeRLHF unlearned prompts	Table 3 reports harmful rates for Original and GA/GA+Mismatch on OPT-1.3B	Table 3
copyright leak rate on extraction prompts	Original 70–81% → GA/GA+Mismatch 0–1% (Table 4)	Original finetuned on HP	≈ −70 percentage points	Harry Potter extraction prompts	Table 4 shows leak rates drop to ~0% after unlearning	Table 4

What To Try In 7 Days

Collect representative negative examples (user reports / red-team outputs) for the unwanted behavior.

Run gradient-ascent unlearning with random-mismatch and KL-to-original losses on a small validation set.

If you need readable refusals, replace random targets with a short templated reply and retune loss weights.

Optimization Features

System Optimization

single-GPU (A100) runs for runtime estimates

Training Optimization

low-cost gradient ascent finetuningLoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/kevinyaobytedance/llm_unlearn

Data URLs

PKU-SafeRLHF (public dataset)TruthfulQA (public dataset)HaluEval (public dataset)BookCorpus (public dataset)

Risks & Boundaries

Limitations

On forbidden prompts the model often outputs nonsensical or repeated characters; templated replies require extra tuning (Section 9.2).

Preserving normal utility is fragile: naive GA can destroy fluency; requires mismatch/KL losses and matching data format (Section 3.2).

When Not To Use

When you need the model to learn desirable, high-quality responses (full RLHF / human-written positives are needed).

When you cannot collect representative negative examples for the behavior you want removed.

Failure Modes

Model learns format shortcuts if normal data format differs from unlearned data, leaving behavior unaltered (Section 3.2).

GA can push model to generate incoherent outputs on normal prompts if mismatch/KL are not used (Table 2).

Core Entities

Models

OPT-1.3BOPT-2.7BLlama2-7Bdeberta-v3-large-v2 (reward model)

Metrics

harmful rateleak ratehallucination ratediversity (unique-token fraction)fluency (perplexity / NM flag)BLEU (copyright extraction)BLEURT (output similarity)BERTScore

Datasets

PKU-SafeRLHF (harmful Q&A)TruthfulQA (normal prompts)BookCorpus (normal data)Harry Potter (copyrighted corpus, experimental)HaluEval (hallucinated Q&A)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Unlearning removes copyrighted completions to near-zero leakage on extraction prompts.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

Key finding

MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

Key finding

Reduce multimodal model hallucinations by learning from segment-level human corrections

Key finding

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

Key finding

FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Key finding