Fast, low-cost debiasing by estimating harmful training samples and 'unlearning' them

October 19, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is practical when you can access model gradients and attribute labels; it cuts retraining cost and works by updating only small parts of a model, but it needs good counterfactual pairs and white-box access.

Citations6

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 70%

Authors

Ruizhe Chen, Jianfei Yang, Huimin Xiong, Jianhong Bai, Tianxiang Hu, Jin Hao, Yang Feng, Joey Tianyi Zhou, Jian Wu, Zuozhu Liu

Links

Abstract / PDF / Data

Why It Matters For Business

FMD lets teams reduce model bias quickly and cheaply by changing only a small external counterfactual set and a few classifier parameters, avoiding costly full retraining or large-scale relabeling.

Who Should Care

Summary TLDR

The paper introduces FMD, a three-step pipeline that (1) finds if a trained model is biased using small counterfactual pairs, (2) quantifies which training samples cause the bias using influence functions, and (3) removes bias by an efficient machine-unlearning update that only tweaks a few classifier layers. Across Colored MNIST, CelebA, Adult Income and two language models, FMD matches or beats debiasing baselines on fairness while using far fewer counterfactual samples, much less time, and only small parameter updates. The method needs white-box access to gradients/Hessian approximations and labeled attributes for constructing counterfactuals.

Problem Statement

Trained models often exploit spurious correlations (e.g., gender ↔ hair color). Existing fixes need expensive bias labeling or full retraining. The paper asks: can we (quickly) identify which training samples create bias and remove their effect from an already-trained model using a small counterfactual dataset and limited parameter updates?

Main Contribution

A practical pipeline (FMD) that identifies bias with counterfactual sample pairs, estimates each training sample's effect on bias via influence functions, and removes bias via a machine-unlearning Newton update.

An efficient, training-data-light unlearning variant that uses a small external counterfactual set and updates only top classifier layers.

Key Findings

On Colored MNIST (bias ratio 0.99) FMD attains nearly the same accuracy as strong baselines while lowering measured counterfactual bias.

NumbersAcc 80.04% vs 80.41%; Bias 0.2042 vs 0.2302; Time 48s vs 1658s; Samples 5k vs 50k

Practical UseYou can reach comparable accuracy but cut bias, runtime, and required counterfactual samples by an order of magnitude using FMD.

Evidence RefTable 1 (Colored MNIST, bias 0.99)

On Adult Income (gender), FMD reduces bias to near-zero using very few samples and low time.

NumbersAcc 81.89%; Bias 0.0005; Time 2.49s; #Samples 500

Practical UseFor structured/tabular tasks, build ~500 counterfactual pairs and unlearn the top harmful samples to sharply reduce measured bias in seconds.

Evidence RefTable 2 (Adult, Gender)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80.04%Rebias 80.41%Colored MNIST (bias ratio 0.99)Ours matches retraining baselines at far lower time and sample costTable 1
Colored MNIST (bias 0.99) counterfactual bias0.2042Rebias 0.2302-0.026Colored MNIST (bias ratio 0.99)Lower measured counterfactual bias with fewer samplesTable 1

What To Try In 7 Days

Pick one deployed classifier and one suspected bias attribute.

Assemble ~500–5,000 factual/counterfactual pairs (flip attribute, keep other features).

Compute influence scores using influence functions with HVPs and rank harmful samples (open-source autograd works).

Agent Features

Tool Use
influence functionsHVP
Frameworks
machine unlearning
Architectures
last-layer finetune

Optimization Features

Infra Optimization
orders-of-magnitude lower GPU time vs retraining in experiments
System Optimization
precompute inverse Hessian once during unlearning
Training Optimization
update only top MLP/classifier layersavoid full retraining

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Colored MNIST (constructed)CelebAAdult Income (UCI)StereoSetCrows-Pairs

Risks & Boundaries

Limitations

Requires white-box access to gradients/Hessian approximations (not for black-box models).

Needs labeled sensitive attributes and counterfactual pairs; quality of counterfactuals affects results.

When Not To Use

The deployed model is a black box with no gradient access.

No reliable way to create factual/counterfactual pairs for the target attribute.

Failure Modes

Poor or non-counterfactual external pairs can under- or over-correct bias.

Removing samples that were actually helpful can drop task accuracy.

Core Entities

Models

ResNet-18BERTGPT-2Logistic Regression

Metrics

Counterfactual biasDemographic parityEqual opportunityAccuracyStereotype Score (SS)Language Modeling Score (LMS)

Datasets

Colored MNISTCelebAAdult Income (UCI)StereoSetCrows-Pairs