GRIN: find the weights that memorize unwanted data, add small noise to them, then fine-tune to forget while keeping utility.

August 8, 20258 min

Overview

Decision SnapshotNeeds Validation

Good empirical evidence across three benchmarks, consistent gains from noise injection and targeted masks. No formal unlearning guarantee and evaluated on a limited set of datasets and models.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Ameya Anjarlekar, Sandeep Pombra

Links

Abstract / PDF / Data

Why It Matters For Business

GRIN gives a low-cost way to comply with deletion requests and reduce unsafe outputs without expensive full retraining. It keeps general capabilities intact while removing targeted memorized content, lowering legal and safety risk at modest compute cost.

Who Should Care

Summary TLDR

The paper introduces GRIN: a practical, targeted unlearning pipeline for LLMs. It ranks model weights by a gradient-ratio score that highlights parameters important for the forget set but less important for retained data, injects Gaussian noise into the top-ranked weights, then applies a small targeted fine-tune (PO / NPO / Grad-Diff). Across TOFU, WMDP and SafePKU benchmarks, GRIN generally improves forgetting (lower memorization metrics) while keeping downstream utility high. The method is modular, architecture-agnostic, and cheap compared to full retraining.

Problem Statement

How to remove specific sensitive or unsafe training data from a large language model without retraining from scratch, while avoiding erasing related useful knowledge and without expensive computation.

Main Contribution

A gradient-ratio influence score (GRI) that ranks each weight by |forget-gradient| / (|retain-gradient| + ε) to localize parameters tied to memorized data.

A noise-injection + targeted fine-tune pipeline (GRIN): add small Gaussian noise to top-ranked weights, then unlearn with PO, NPO, or Grad-Diff.

Key Findings

Targeted gradient-ratio selection plus noise (GRIN) yields very low forget-set keyword accuracy on TOFU while preserving retain-set utility.

NumbersTOFU: forget Keyword Accuracy K-Acc 0.015 (GRIN) vs 0.948 (Original); retain ROUGE 0.956 (GRIN).

Practical UseFor selective removal of specific text records, apply GRIN to a small top-ranked weight subset to strongly reduce direct memorization without major loss on kept knowledge.

Evidence RefTable 1

Noise injection improves forgetting when added before unlearning fine-tuning.

NumbersTOFU: Full FT ROUGE 0.084 -> FT-N 0.072 after noise (lower is better for forgetting).

Practical UseBefore fine-tuning to forget, add a small amount of Gaussian noise to targeted weights to increase gradient flow and boost forgetting effectiveness.

Evidence RefTable 1 (FT vs FT-N)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.015 (GRIN)0.948 (Original pre-unlearning)−0.933TOFU (10% authors forget)Table 1 shows GRIN K-Acc 0.015 vs Original 0.948Table 1
TOFU retain ROUGE-L Recall (higher better)0.956 (GRIN)0.982 (Original)−0.026TOFU retainTable 1 retain ROUGE for GRIN = 0.956Table 1

What To Try In 7 Days

Run the GRI score on a small forget set to rank influential weights (compute retain & forget gradients).

Inject small Gaussian noise (start with variance 0.001) into top p% weights (try p in {0.2,0.4,0.6,0.8}).

Fine-tune only masked weights for 5–10 epochs using PO or NPO and evaluate with keyword/ROUGE and a utility benchmark (e.g., MMLU).

Optimization Features

Infra Optimization
experiments run on 8x A100-80GB; per-epoch ≈3 minutes for reported models
Model Optimization
targeted weight selection (masking top p% weights)LoRA
System Optimization
mask generation is small overhead vs unlearning epochs
Training Optimization
finetune only masked parameters (saves compute vs full FT)grid-search small learning rates and noise variance

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HuggingFace datasets (TOFU, WMDP, SafePKU, MMLU, C4 indicated in paper)

Risks & Boundaries

Limitations

No formal / certified unlearning guarantees—empirical only.

Performance and optimal mask fraction vary by task; needs per-task tuning of p and noise variance.

When Not To Use

When legal or regulatory requirements demand provable/certified erasure (exact unlearning).

When you can afford or prefer to fully retrain the model from scratch.

Failure Modes

Incomplete forgetting: some paraphrased or semantic traces may persist despite surface-token metrics dropping.

Collateral forgetting: over-selecting weights can degrade unrelated knowledge.

Core Entities

Models

tofu-ft-llama2-7bZephyr-7B-BetaLLaMA-27B-chat (tofu-ft base)

Metrics

Truth RatioROUGE-L RecallAccuracyKeyword ConfidenceToxic RateMean Toxic ScorePerplexity

Datasets

TOFUWMDP-CyberSafePKUMMLUC4RealToxicityPromptsWikiText

Benchmarks

TOFUWMDPSafePKUMMLU