GRIN: find the weights that memorize unwanted data, add small noise to them, then fine-tune to forget while keeping utility.

August 8, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Ameya Anjarlekar, Sandeep Pombra

Links

Abstract / PDF

Why It Matters For Business

GRIN gives a low-cost way to comply with deletion requests and reduce unsafe outputs without expensive full retraining. It keeps general capabilities intact while removing targeted memorized content, lowering legal and safety risk at modest compute cost.

Summary TLDR

The paper introduces GRIN: a practical, targeted unlearning pipeline for LLMs. It ranks model weights by a gradient-ratio score that highlights parameters important for the forget set but less important for retained data, injects Gaussian noise into the top-ranked weights, then applies a small targeted fine-tune (PO / NPO / Grad-Diff). Across TOFU, WMDP and SafePKU benchmarks, GRIN generally improves forgetting (lower memorization metrics) while keeping downstream utility high. The method is modular, architecture-agnostic, and cheap compared to full retraining.

Problem Statement

How to remove specific sensitive or unsafe training data from a large language model without retraining from scratch, while avoiding erasing related useful knowledge and without expensive computation.

Main Contribution

A gradient-ratio influence score (GRI) that ranks each weight by |forget-gradient| / (|retain-gradient| + ε) to localize parameters tied to memorized data.

A noise-injection + targeted fine-tune pipeline (GRIN): add small Gaussian noise to top-ranked weights, then unlearn with PO, NPO, or Grad-Diff.

New LLM-oriented evaluation metrics (Truth Ratio, Keyword Accuracy/Confidence, ROUGE-L Recall, toxic-rate measures) and cross-benchmark tests on TOFU, WMDP, and SafePKU.

An empirical claim that targeted masking plus noise gives a better forgetting-vs-utility trade-off than full fine-tuning or prior weight-selection methods.

Key Findings

Targeted gradient-ratio selection plus noise (GRIN) yields very low forget-set keyword accuracy on TOFU while preserving retain-set utility.

NumbersTOFU: forget Keyword Accuracy K-Acc 0.015 (GRIN) vs 0.948 (Original); retain ROUGE 0.956 (GRIN).

Noise injection improves forgetting when added before unlearning fine-tuning.

NumbersTOFU: Full FT ROUGE 0.084 -> FT-N 0.072 after noise (lower is better for forgetting).

GRIN reduces harmful-domain accuracy to near-random while keeping general capability on a broad benchmark.

NumbersWMDP-Cyber: forget Acc 0.26 (GRIN) vs 0.45 (no unlearning); retain MMLU Acc 0.577 (GRIN).

GRIN reduces toxic outputs to near-zero on safety benchmarks while maintaining downstream task accuracy.

NumbersSafePKU (Zephyr): forget Toxic Rate 0.00 (GRIN); BoolQ Acc increases from 0.66 (Full FT) to 0.69 (GRIN).

Mask generation time is small compared to total unlearning cost; overall runtime is similar to prior targeted methods.

NumbersMask gen: Ours 132s vs WAGLE 124s; unlearning time (10 epochs) Ours 2000s vs WAGLE 2122s.

Results

Accuracy

Value0.015 (GRIN)

Baseline0.948 (Original pre-unlearning)

TOFU retain ROUGE-L Recall (higher better)

Value0.956 (GRIN)

Baseline0.982 (Original)

Accuracy

Value0.26 (GRIN)

Baseline0.45 (Without unlearning)

Accuracy

Value0.577 (GRIN)

Baseline0.585 (Without unlearning)

SafePKU forget Toxic Rate (lower better)

Value0.00 (GRIN, Zephyr-7B-Beta)

Baseline0.033 (Original)

Mask generation time

Value132 s (GRIN)

Baseline124 s (WAGLE)

Who Should Care

What To Try In 7 Days

Run the GRI score on a small forget set to rank influential weights (compute retain & forget gradients).

Inject small Gaussian noise (start with variance 0.001) into top p% weights (try p in {0.2,0.4,0.6,0.8}).

Fine-tune only masked weights for 5–10 epochs using PO or NPO and evaluate with keyword/ROUGE and a utility benchmark (e.g., MMLU).

Optimization Features

Infra Optimization

  • experiments run on 8x A100-80GB; per-epoch ≈3 minutes for reported models

Model Optimization

  • targeted weight selection (masking top p% weights)
  • LoRA

System Optimization

  • mask generation is small overhead vs unlearning epochs

Training Optimization

  • finetune only masked parameters (saves compute vs full FT)
  • grid-search small learning rates and noise variance

Reproducibility

Data Urls

  • HuggingFace datasets (TOFU, WMDP, SafePKU, MMLU, C4 indicated in paper)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No formal / certified unlearning guarantees—empirical only.
  • Performance and optimal mask fraction vary by task; needs per-task tuning of p and noise variance.
  • Evaluations are on a handful of benchmarks and two base models; generalization to very large or quantized models not shown.
  • Noise injection may have unintended effects on rare behaviors not captured by current metrics.

When Not To Use

  • When legal or regulatory requirements demand provable/certified erasure (exact unlearning).
  • When you can afford or prefer to fully retrain the model from scratch.
  • On heavily quantized or compressed models if not tested (paper notes quantization can break unlearning in other work).

Failure Modes

  • Incomplete forgetting: some paraphrased or semantic traces may persist despite surface-token metrics dropping.
  • Collateral forgetting: over-selecting weights can degrade unrelated knowledge.
  • Optimization instability: gradient-subtraction methods can be unstable without careful tuning.

Core Entities

Models

  • tofu-ft-llama2-7b
  • Zephyr-7B-Beta
  • LLaMA-27B-chat (tofu-ft base)

Metrics

  • Truth Ratio
  • ROUGE-L Recall
  • Accuracy
  • Keyword Confidence
  • Toxic Rate
  • Mean Toxic Score
  • Perplexity

Datasets

  • TOFU
  • WMDP-Cyber
  • SafePKU
  • MMLU
  • C4
  • RealToxicityPrompts
  • WikiText

Benchmarks

  • TOFU
  • WMDP
  • SafePKU
  • MMLU