Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

October 14, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

9

Authors

Yuanshun Yao, Xiaojun Xu, Yang Liu

Links

Abstract / PDF

Why It Matters For Business

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Summary TLDR

This paper introduces a practical way to make large language models stop producing specific undesirable outputs (harmful replies, copyrighted text, or hallucinated facts) by 'unlearning' with only negative examples. The method is a light, finetune-style procedure built on gradient ascent on bad outputs plus two losses to preserve normal behavior. On multiple models and datasets the method drives harmful/leak/hallucination rates to near zero on evaluated prompts, generalizes to unseen similar prompts, and costs roughly 2% of a full RLHF run. Trade-offs: outputs on forbidden prompts often become nonsensical unless you force templated replies, and preserving utility needs careful normal-data KL

Problem Statement

How can practitioners quickly remove specific unwanted behaviors from a pretrained LLM (harmful answers, copyrighted memorization, or wrong facts) using only examples of the unwanted outputs, without retraining the whole model and with low compute?

Main Contribution

Formulate LLM unlearning: goals, setting, and metrics for removing undesirable outputs using only negative examples.

A practical unlearning recipe: gradient-ascent on bad outputs + random-mismatch loss + KL-loss to original model to preserve utility.

Demonstrate on three tasks (harmfulness, copyrighted text, hallucination) across OPT and Llama2 models: high effectiveness and strong generalization to unseen similar prompts.

Show major compute advantage: unlearning needs ~2% of the runtime of a full RLHF pipeline while achieving similar or better reductions in harmful outputs in the tested setting.

Provide ablations: templated outputs, mismatch loss benefits, and comparisons to SFT/RLHF.

Key Findings

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbersharmful rate 47% -> 1% (OPT-1.3B, Table 3)

Unlearning removes copyrighted completions to near-zero leakage on extraction prompts.

Numberscopyright leak rate 70–81% -> 0–1% after GA (Table 4)

Unlearning lowers hallucination rates substantially but not always to zero; it generalizes to similar misleading questions.

Numbershallucination rate reduced from ~50–60% to ~10–15% on in-distribution tests (Table 5)

Compute cost is much lower than RLHF: unlearning runs in ~2% of the full RLHF time on a single A100 GPU.

Numberscompute ≈ 2% of full RLHF pipeline (Figure 3, Table 6)

Naive gradient ascent can destroy normal-model utility; adding a random-mismatch loss and KL to the original model preserves utility better.

NumbersGA produced nonsensical outputs; GA+Mismatch shows higher utility reward and similarity to original (Table 3, Table 6)

Results

harmful rate on unlearned prompts (OPT-1.3B)

ValueOriginal 47% → GA 1% (Table 3)

BaselineOriginal model

copyright leak rate on extraction prompts

ValueOriginal 70–81% → GA/GA+Mismatch 0–1% (Table 4)

BaselineOriginal finetuned on HP

hallucination rate on misleading questions

ValueOriginal ~50–60% → GA/GA+Mismatch ~10–15% (Table 5)

BaselineOriginal model

compute cost vs RLHF

ValueUnlearning ≈ 2% of full RLHF runtime (Figure 3, Section 9.1)

BaselineFull RLHF pipeline runtime

utility preservation (normal prompts)

ValueGA often breaks fluency; GA+Mismatch preserves utility and similarity closer to original (Table 3)

BaselineOriginal model outputs

Who Should Care

What To Try In 7 Days

Collect representative negative examples (user reports / red-team outputs) for the unwanted behavior.

Run gradient-ascent unlearning with random-mismatch and KL-to-original losses on a small validation set.

If you need readable refusals, replace random targets with a short templated reply and retune loss weights.

Optimization Features

System Optimization

  • single-GPU (A100) runs for runtime estimates

Training Optimization

  • low-cost gradient ascent finetuning
  • LoRA

Reproducibility

Data Urls

  • PKU-SafeRLHF (public dataset)
  • TruthfulQA (public dataset)
  • HaluEval (public dataset)
  • BookCorpus (public dataset)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • On forbidden prompts the model often outputs nonsensical or repeated characters; templated replies require extra tuning (Section 9.2).
  • Preserving normal utility is fragile: naive GA can destroy fluency; requires mismatch/KL losses and matching data format (Section 3.2).
  • Unlearning reduces undesirable outputs but does not produce helpful positive responses (no positive examples assumed).
  • Evaluation uses original model as ground truth (no retrained oracle); membership-inference and full training-corpus effects are not addressed.

When Not To Use

  • When you need the model to learn desirable, high-quality responses (full RLHF / human-written positives are needed).
  • When you cannot collect representative negative examples for the behavior you want removed.
  • When silent or nonsensical outputs are unacceptable and templated refusals are not allowed.

Failure Modes

  • Model learns format shortcuts if normal data format differs from unlearned data, leaving behavior unaltered (Section 3.2).
  • GA can push model to generate incoherent outputs on normal prompts if mismatch/KL are not used (Table 2).
  • Unlearning specific items may not eliminate related but different memorized content beyond the distribution of D_fgt.

Core Entities

Models

  • OPT-1.3B
  • OPT-2.7B
  • Llama2-7B
  • deberta-v3-large-v2 (reward model)

Metrics

  • harmful rate
  • leak rate
  • hallucination rate
  • diversity (unique-token fraction)
  • fluency (perplexity / NM flag)
  • BLEU (copyright extraction)
  • BLEURT (output similarity)
  • BERTScore

Datasets

  • PKU-SafeRLHF (harmful Q&A)
  • TruthfulQA (normal prompts)
  • BookCorpus (normal data)
  • Harry Potter (copyrighted corpus, experimental)
  • HaluEval (hallucinated Q&A)