Make transformer teachers teach CNN students better by aligning receptive fields and adding prompts

June 26, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Weisong Zhao, Xiangyu Zhu, Zhixiang He, Xiao-Yu Zhang, Zhen Lei

Links

Abstract / PDF

Why It Matters For Business

If you run face recognition on mobile or edge devices, distilling a high-performing Transformer into a CNN can boost verification accuracy substantially while keeping hardware-friendly inference.

Summary TLDR

This paper improves knowledge distillation from Transformer teachers to CNN students for face recognition. Two practical problems are fixed: (1) a pixel/receptive-field mismatch between architectures and (2) teachers not tuned for the distillation role. The authors propose URFM, which maps pixel features into local features with unified receptive fields, and APT, which inserts learnable prompts into the teacher so it can adapt during distillation. On standard face benchmarks and large-scale sets the combined method consistently beats prior KD methods by several points on verification metrics.

Problem Statement

Standard KD techniques work well when teacher and student share similar architectures. When the teacher is a Transformer and the student a CNN, per-pixel spatial attention and receptive fields differ, which breaks feature alignment. Also, off-the-shelf teacher models are not trained to act as teachers and may be hard for students to learn from. The problem: how to distill knowledge effectively across architectures for face recognition.

Main Contribution

Unified Receptive Fields Mapping (URFM): maps teacher and student pixel features to local features with the same receptive fields using learnable local centers plus facial positional encoding.

Adaptable Prompting Teacher (APT): inserts and optimizes learnable prompts inside the pretrained Transformer teacher during distillation so the teacher can manage distillation-specific knowledge.

Extensive experiments and ablations on standard face benchmarks and MegaFace show consistent gains over prior KD methods for Transformer→CNN distillation.

Key Findings

Cross-architecture KD with URFM+APT substantially improves large-scale verification.

NumbersIJB-C TPR@FPR=1e-4: 94.4 (Ours) vs 89.13 (student baseline) +5.27

Combining attention alignment, adaptable prompting and URFM raises small-benchmark accuracy.

NumbersCFP-FP acc: 94.63 (full method) vs 91.3 (FitNet baseline) +3.33

Saliency-based facial positional encoding helps alignment.

NumbersAgeDB acc: 97.20 (Euc+SD) vs 96.66 (Euc) +0.54

There is a useful trade-off in teacher adaptivity: neither fully frozen nor fully trainable is best.

NumbersStudent AgeDB acc: 97.20 with 25 prompts vs 95.94 when teacher frozen (+1.26)

Results

IJB-C TPR@FPR=1e-4

Value94.4

BaselineMobileFaceNet student 89.13

IJB-B TPR@FPR=1e-4

Value92.48

BaselineMobileFaceNet student 87.07

MegaFace Rank-1 (Id)

Value95.37

BaselineMobileFaceNet student 90.91

Accuracy

Valuee.g., CFP-FP 94.63, CPLFW 91.14, AgeDB 97.20

BaselineFitNet baseline CFP-FP 91.3 etc.

Who Should Care

What To Try In 7 Days

Prototype URFM: map teacher and student feature maps to a small set of local centers and verify alignment gains on your validation set.

Add prompts to your pretrained Transformer teacher (APT) and fine-tune only prompts during distillation; sweep 5–50 prompts.

Replace plain positional encoding with saliency-based facial positional encoding using landmarks to test small but consistent gains.

Optimization Features

Model Optimization

  • Cross-architecture feature alignment via URFM

Training Optimization

  • Prompt-only optimization inside teacher (APT) to limit finetuning

Reproducibility

Data Urls

  • MS1MV2 (public training set)
  • LFW, CFP-FP, CPLFW, AgeDB, CALFW, IJB-B, IJB-C, MegaFace (public benchmarks)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Tested only on face recognition; unclear if URFM generalizes to non-face tasks.
  • Requires facial landmarks for saliency PE; landmark failures may degrade alignment.
  • Adds extra training complexity (URFM + prompt tuning) and modest compute for distillation.

When Not To Use

  • If teacher and student are the same architecture (homologous KD), benefits are smaller.
  • When you cannot run landmark/keypoint detectors reliably on your data.
  • When you need the absolute simplest KD pipeline with no extra modules.

Failure Modes

  • Poor landmark detection leads to bad positional encoding and worse distillation.
  • Overly many prompts can reduce teacher discriminative power and cause student underperformance.
  • URFM hyperparameters (number of local centers) set too low can lose facial structure information.

Core Entities

Models

  • Swin-S (Transformer teacher)
  • ViT-S (Transformer teacher)
  • MobileFaceNet (CNN student)
  • IResNet-50 (CNN teacher/student)
  • IResNet-18 (CNN student)

Metrics

  • Accuracy
  • TPR@FPR (e.g., 1e-4, 1e-6)
  • Rank-1 identification

Datasets

  • MS1MV2 (training)
  • LFW
  • CFP-FP
  • CPLFW
  • AgeDB-30
  • CALFW
  • IJB-B
  • IJB-C
  • MegaFace (FaceScrub probe)

Benchmarks

  • LFW
  • CFP-FP
  • CPLFW
  • AgeDB-30
  • CALFW
  • IJB-B
  • IJB-C
  • MegaFace