Fast, low-cost debiasing by estimating harmful training samples and 'unlearning' them

October 19, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

6

Authors

Ruizhe Chen, Jianfei Yang, Huimin Xiong, Jianhong Bai, Tianxiang Hu, Jin Hao, Yang Feng, Joey Tianyi Zhou, Jian Wu, Zuozhu Liu

Links

Abstract / PDF

Why It Matters For Business

FMD lets teams reduce model bias quickly and cheaply by changing only a small external counterfactual set and a few classifier parameters, avoiding costly full retraining or large-scale relabeling.

Summary TLDR

The paper introduces FMD, a three-step pipeline that (1) finds if a trained model is biased using small counterfactual pairs, (2) quantifies which training samples cause the bias using influence functions, and (3) removes bias by an efficient machine-unlearning update that only tweaks a few classifier layers. Across Colored MNIST, CelebA, Adult Income and two language models, FMD matches or beats debiasing baselines on fairness while using far fewer counterfactual samples, much less time, and only small parameter updates. The method needs white-box access to gradients/Hessian approximations and labeled attributes for constructing counterfactuals.

Problem Statement

Trained models often exploit spurious correlations (e.g., gender ↔ hair color). Existing fixes need expensive bias labeling or full retraining. The paper asks: can we (quickly) identify which training samples create bias and remove their effect from an already-trained model using a small counterfactual dataset and limited parameter updates?

Main Contribution

A practical pipeline (FMD) that identifies bias with counterfactual sample pairs, estimates each training sample's effect on bias via influence functions, and removes bias via a machine-unlearning Newton update.

An efficient, training-data-light unlearning variant that uses a small external counterfactual set and updates only top classifier layers.

Empirical evidence across image, tabular, and language tasks showing large fairness gains with much lower time and data cost than many baselines.

Key Findings

On Colored MNIST (bias ratio 0.99) FMD attains nearly the same accuracy as strong baselines while lowering measured counterfactual bias.

NumbersAcc 80.04% vs 80.41%; Bias 0.2042 vs 0.2302; Time 48s vs 1658s; Samples 5k vs 50k

On Adult Income (gender), FMD reduces bias to near-zero using very few samples and low time.

NumbersAcc 81.89%; Bias 0.0005; Time 2.49s; #Samples 500

On CelebA FMD improves worst-group and unbiased accuracy while cutting bias compared to debias training baselines.

NumbersBlonde: Unb 89.73% (vs LfF 84.33%), Wor 87.15% (vs 81.24%), Bias 0.0717

Results

Accuracy

Value80.04%

BaselineRebias 80.41%

Colored MNIST (bias 0.99) counterfactual bias

Value0.2042

BaselineRebias 0.2302

Adult (gender) bias

Value0.0005

BaselineLfF 0.0036

Accuracy

Value87.15%

BaselineLfF 81.24%

BERT (gender) bias metrics

ValueSS 57.77, LMS 85.45

BaselineVanilla SS 60.28, LMS 84.17

Who Should Care

What To Try In 7 Days

Pick one deployed classifier and one suspected bias attribute.

Assemble ~500–5,000 factual/counterfactual pairs (flip attribute, keep other features).

Compute influence scores using influence functions with HVPs and rank harmful samples (open-source autograd works).

Agent Features

Tool Use

  • influence functions
  • HVP

Frameworks

  • machine unlearning

Architectures

  • last-layer finetune

Optimization Features

Infra Optimization

  • orders-of-magnitude lower GPU time vs retraining in experiments

System Optimization

  • precompute inverse Hessian once during unlearning

Training Optimization

  • update only top MLP/classifier layers
  • avoid full retraining

Reproducibility

Data Urls

  • Colored MNIST (constructed)
  • CelebA
  • Adult Income (UCI)
  • StereoSet
  • Crows-Pairs

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires white-box access to gradients/Hessian approximations (not for black-box models).
  • Needs labeled sensitive attributes and counterfactual pairs; quality of counterfactuals affects results.
  • Hessian/influence approximations can be less accurate in highly non-convex or very large models.

When Not To Use

  • The deployed model is a black box with no gradient access.
  • No reliable way to create factual/counterfactual pairs for the target attribute.
  • When changing many layers is acceptable and full retraining cost is affordable.

Failure Modes

  • Poor or non-counterfactual external pairs can under- or over-correct bias.
  • Removing samples that were actually helpful can drop task accuracy.
  • Approximate influence/Hessian errors can produce suboptimal unlearning steps in non-convex models.

Core Entities

Models

  • ResNet-18
  • BERT
  • GPT-2
  • Logistic Regression

Metrics

  • Counterfactual bias
  • Demographic parity
  • Equal opportunity
  • Accuracy
  • Stereotype Score (SS)
  • Language Modeling Score (LMS)

Datasets

  • Colored MNIST
  • CelebA
  • Adult Income (UCI)
  • StereoSet
  • Crows-Pairs