GRIN: find the weights that memorize unwanted data, add small noise to them, then fine-tune to forget while keeping utility.

Overview

Decision SnapshotNeeds Validation

Good empirical evidence across three benchmarks, consistent gains from noise injection and targeted masks. No formal unlearning guarantee and evaluated on a limited set of datasets and models.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Ameya Anjarlekar, Sandeep Pombra

Links

Abstract / PDF / Data

Why It Matters For Business

GRIN gives a low-cost way to comply with deletion requests and reduce unsafe outputs without expensive full retraining. It keeps general capabilities intact while removing targeted memorized content, lowering legal and safety risk at modest compute cost.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

The paper introduces GRIN: a practical, targeted unlearning pipeline for LLMs. It ranks model weights by a gradient-ratio score that highlights parameters important for the forget set but less important for retained data, injects Gaussian noise into the top-ranked weights, then applies a small targeted fine-tune (PO / NPO / Grad-Diff). Across TOFU, WMDP and SafePKU benchmarks, GRIN generally improves forgetting (lower memorization metrics) while keeping downstream utility high. The method is modular, architecture-agnostic, and cheap compared to full retraining.

Problem Statement

How to remove specific sensitive or unsafe training data from a large language model without retraining from scratch, while avoiding erasing related useful knowledge and without expensive computation.

Main Contribution

A gradient-ratio influence score (GRI) that ranks each weight by |forget-gradient| / (|retain-gradient| + ε) to localize parameters tied to memorized data.

A noise-injection + targeted fine-tune pipeline (GRIN): add small Gaussian noise to top-ranked weights, then unlearn with PO, NPO, or Grad-Diff.

Key Findings

Targeted gradient-ratio selection plus noise (GRIN) yields very low forget-set keyword accuracy on TOFU while preserving retain-set utility.

NumbersTOFU: forget Keyword Accuracy K-Acc 0.015 (GRIN) vs 0.948 (Original); retain ROUGE 0.956 (GRIN).

Practical UseFor selective removal of specific text records, apply GRIN to a small top-ranked weight subset to strongly reduce direct memorization without major loss on kept knowledge.

Evidence RefTable 1

Noise injection improves forgetting when added before unlearning fine-tuning.

NumbersTOFU: Full FT ROUGE 0.084 -> FT-N 0.072 after noise (lower is better for forgetting).

Practical UseBefore fine-tuning to forget, add a small amount of Gaussian noise to targeted weights to increase gradient flow and boost forgetting effectiveness.

Evidence RefTable 1 (FT vs FT-N)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.015 (GRIN)	0.948 (Original pre-unlearning)	−0.933	TOFU (10% authors forget)	Table 1 shows GRIN K-Acc 0.015 vs Original 0.948	Table 1
TOFU retain ROUGE-L Recall (higher better)	0.956 (GRIN)	0.982 (Original)	−0.026	TOFU retain	Table 1 retain ROUGE for GRIN = 0.956	Table 1

What To Try In 7 Days

Run the GRI score on a small forget set to rank influential weights (compute retain & forget gradients).

Inject small Gaussian noise (start with variance 0.001) into top p% weights (try p in {0.2,0.4,0.6,0.8}).

Fine-tune only masked weights for 5–10 epochs using PO or NPO and evaluate with keyword/ROUGE and a utility benchmark (e.g., MMLU).

Optimization Features

Infra Optimization

experiments run on 8x A100-80GB; per-epoch ≈3 minutes for reported models

Model Optimization

targeted weight selection (masking top p% weights)LoRA

System Optimization

mask generation is small overhead vs unlearning epochs

Training Optimization

finetune only masked parameters (saves compute vs full FT)grid-search small learning rates and noise variance

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

HuggingFace datasets (TOFU, WMDP, SafePKU, MMLU, C4 indicated in paper)

Risks & Boundaries

Limitations

No formal / certified unlearning guarantees—empirical only.

Performance and optimal mask fraction vary by task; needs per-task tuning of p and noise variance.

When Not To Use

When legal or regulatory requirements demand provable/certified erasure (exact unlearning).

When you can afford or prefer to fully retrain the model from scratch.

Failure Modes

Incomplete forgetting: some paraphrased or semantic traces may persist despite surface-token metrics dropping.

Collateral forgetting: over-selecting weights can degrade unrelated knowledge.

Core Entities

Models

tofu-ft-llama2-7bZephyr-7B-BetaLLaMA-27B-chat (tofu-ft base)

Metrics

Truth RatioROUGE-L RecallAccuracyKeyword ConfidenceToxic RateMean Toxic ScorePerplexity

Datasets

TOFUWMDP-CyberSafePKUMMLUC4RealToxicityPromptsWikiText

Benchmarks

TOFUWMDPSafePKUMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Targeted gradient-ratio selection plus noise (GRIN) yields very low forget-set keyword accuracy on TOFU while preserving retain-set utility.

Noise injection improves forgetting when added before unlearning fine-tuning.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding