Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
6
Why It Matters For Business
FMD lets teams reduce model bias quickly and cheaply by changing only a small external counterfactual set and a few classifier parameters, avoiding costly full retraining or large-scale relabeling.
Summary TLDR
The paper introduces FMD, a three-step pipeline that (1) finds if a trained model is biased using small counterfactual pairs, (2) quantifies which training samples cause the bias using influence functions, and (3) removes bias by an efficient machine-unlearning update that only tweaks a few classifier layers. Across Colored MNIST, CelebA, Adult Income and two language models, FMD matches or beats debiasing baselines on fairness while using far fewer counterfactual samples, much less time, and only small parameter updates. The method needs white-box access to gradients/Hessian approximations and labeled attributes for constructing counterfactuals.
Problem Statement
Trained models often exploit spurious correlations (e.g., gender ↔ hair color). Existing fixes need expensive bias labeling or full retraining. The paper asks: can we (quickly) identify which training samples create bias and remove their effect from an already-trained model using a small counterfactual dataset and limited parameter updates?
Main Contribution
A practical pipeline (FMD) that identifies bias with counterfactual sample pairs, estimates each training sample's effect on bias via influence functions, and removes bias via a machine-unlearning Newton update.
An efficient, training-data-light unlearning variant that uses a small external counterfactual set and updates only top classifier layers.
Empirical evidence across image, tabular, and language tasks showing large fairness gains with much lower time and data cost than many baselines.
Key Findings
On Colored MNIST (bias ratio 0.99) FMD attains nearly the same accuracy as strong baselines while lowering measured counterfactual bias.
On Adult Income (gender), FMD reduces bias to near-zero using very few samples and low time.
On CelebA FMD improves worst-group and unbiased accuracy while cutting bias compared to debias training baselines.
Results
Accuracy
Colored MNIST (bias 0.99) counterfactual bias
Adult (gender) bias
Accuracy
BERT (gender) bias metrics
Who Should Care
What To Try In 7 Days
Pick one deployed classifier and one suspected bias attribute.
Assemble ~500–5,000 factual/counterfactual pairs (flip attribute, keep other features).
Compute influence scores using influence functions with HVPs and rank harmful samples (open-source autograd works).
Agent Features
Tool Use
- influence functions
- HVP
Frameworks
- machine unlearning
Architectures
- last-layer finetune
Optimization Features
Infra Optimization
- orders-of-magnitude lower GPU time vs retraining in experiments
System Optimization
- precompute inverse Hessian once during unlearning
Training Optimization
- update only top MLP/classifier layers
- avoid full retraining
Reproducibility
Data Urls
- Colored MNIST (constructed)
- CelebA
- Adult Income (UCI)
- StereoSet
- Crows-Pairs
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires white-box access to gradients/Hessian approximations (not for black-box models).
- Needs labeled sensitive attributes and counterfactual pairs; quality of counterfactuals affects results.
- Hessian/influence approximations can be less accurate in highly non-convex or very large models.
When Not To Use
- The deployed model is a black box with no gradient access.
- No reliable way to create factual/counterfactual pairs for the target attribute.
- When changing many layers is acceptable and full retraining cost is affordable.
Failure Modes
- Poor or non-counterfactual external pairs can under- or over-correct bias.
- Removing samples that were actually helpful can drop task accuracy.
- Approximate influence/Hessian errors can produce suboptimal unlearning steps in non-convex models.
Core Entities
Models
- ResNet-18
- BERT
- GPT-2
- Logistic Regression
Metrics
- Counterfactual bias
- Demographic parity
- Equal opportunity
- Accuracy
- Stereotype Score (SS)
- Language Modeling Score (LMS)
Datasets
- Colored MNIST
- CelebA
- Adult Income (UCI)
- StereoSet
- Crows-Pairs

