Simple prompts and filters can match finetuning on output-level 'unlearning' and expose benchmark blind spots

March 5, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

3

Authors

Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith

Links

Abstract / PDF

Why It Matters For Business

Guardrails (prompts and filters) are low-cost ways to hide or block sensitive outputs from API-accessible models; use them as quick mitigation, QA checks, or to generate finetuning data before spending on full retraining.

Summary TLDR

The paper tests lightweight 'guardrails'—prompt prefixes, input/output filters, and simple classifier heads—as baselines for removing knowledge from large language models (LLMs). Across three recent unlearning benchmarks (Who's Harry Potter, TOFU, WMDP) guardrails often match or approach finetuning on output-based metrics. Guardrails are cheap and fast to try, but they do not change model weights, can be brittle to adversarial queries, and may fail when many items must be forgotten. The authors recommend using guardrails as sanity-check baselines and redesigning unlearning metrics to distinguish output-only fixes from true parameter-level forgetting.

Problem Statement

Finetuning is the main route people use to make LLMs 'forget' data, but it is computationally costly and complex. The paper asks: can much cheaper methods—prompt prefixes, input/output filters, or small classifier heads—produce comparable unlearning on the common output-level benchmarks, and what does that imply for evaluation?

Main Contribution

Show that simple guardrails (prompt prefixes, input/output filters, keyword matching, and linear classifier heads) can achieve comparable output-level unlearning on three public benchmarks.

Compare guardrails against finetuning-centric methods on Who's Harry Potter, TOFU, and WMDP and report numeric results.

Highlight cases where benchmarks or metrics are insensitive to whether forgetting changes model weights, and recommend evaluating guardrail baselines before complex methods.

Key Findings

Prompting halved the 'familiarity' score on LLaMA-2-7b for the Who's Harry Potter benchmark.

Numbers≈50% reduction vs baseline LLaMA-2-7b (Figure 2)

Using GPT-4 as an output filter on TOFU produced very high forget and retain rates.

NumbersForget accuracy 0.95–0.975; retain accuracy 0.995–0.998 (Table 2)

Input filtering on WMDP matched competitive unlearning methods on overall task metrics.

NumbersFiltering All score 56.6 vs RMU 57.1 and Base 58.1 (Table 3)

Results

Harry Potter familiarity score

ValuePrompting cut familiarity for LLaMA-2-7b by ≈50%

BaselineLLaMA-2-7b FT-0

Accuracy

ValueGPT-4 filter: forget 0.95–0.975, retain 0.995–0.998

Baselinefinetuned LLaMA-2-7b on TOFU

Accuracy

ValueFiltering All = 56.6

BaselineRMU All = 57.1; Base = 58.1

Downstream benchmarks after prompting (llama-2-7b-chat-hf)

ValuePrompting comparable on ARC-challenge (0.455 vs FT-120 0.414) but much worse on HellaSwag (0.189 vs FT-120 0.557)

BaselineFT-120 finetuned model

Who Should Care

What To Try In 7 Days

Add a one-line unlearn prefix to queries and measure topic familiarity on your test set.

Implement a lightweight output filter: use a stronger hosted model or train a frozen LLM + linear classifier to abstain on sensitive queries.

Run keyword string-matching as a quick sanity check to find trivial metric blind spots.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Guardrails do not change model weights and thus fail strict 'parameter-level' unlearning definitions.
  • The honest-but-curious threat model excludes adversarial jailbreak attacks; robustness to adaptive attackers is not evaluated.
  • Effectiveness can decline as the forget set grows; scaling to many deletions may be inefficient.
  • Some filters (e.g., GPT-4) introduce privacy or cost concerns when used as external services.

When Not To Use

  • When legal requirements mandate deletion from model weights rather than outputs.
  • When adversaries can perform adaptive or adversarial queries to probe model internals.
  • If many distinct items must be forgotten at scale and filters become costly or slow.

Failure Modes

  • Prompt or filter can be bypassed by adversarial/jailbreak prompts.
  • Finetuning can induce hallucination while guardrails may abstain, causing metric mismatch.
  • Keyword filters can trivially break benchmarks by making metrics undefined (e.g., truth ratio denominator zero).
  • Using a hosted filter model on private data can leak sensitive information.

Core Entities

Models

  • llama-2-7b-chat-hf
  • llama-2-13b
  • GPT-4
  • RMU
  • SSD

Metrics

  • familiarity score (0-5)
  • Accuracy
  • truth ratio (TOFU)
  • MT-Bench fluency

Datasets

  • Who's Harry Potter (WHP)
  • TOFU (fictional authors)
  • WMDP (Weapons of Mass Destruction Proxy)
  • MMLU
  • HellaSwag
  • ARC-easy
  • ARC-challenge
  • OpenBookQA

Benchmarks

  • Who's Harry Potter
  • TOFU
  • WMDP
  • MMLU
  • HellaSwag
  • ARC
  • OpenBookQA

Context Entities

Models

  • LLaMA family
  • linear classification head on LLaMA-2-7b