Make transformer teachers teach CNN students better by aligning receptive fields and adding prompts

June 26, 20237 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent metric gains across benchmarks and ablations, but requires extra modules (URFM) and prompt tuning; results are limited to face recognition datasets.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Weisong Zhao, Xiangyu Zhu, Zhixiang He, Xiao-Yu Zhang, Zhen Lei

Links

Abstract / PDF / Data

Why It Matters For Business

If you run face recognition on mobile or edge devices, distilling a high-performing Transformer into a CNN can boost verification accuracy substantially while keeping hardware-friendly inference.

Who Should Care

Summary TLDR

This paper improves knowledge distillation from Transformer teachers to CNN students for face recognition. Two practical problems are fixed: (1) a pixel/receptive-field mismatch between architectures and (2) teachers not tuned for the distillation role. The authors propose URFM, which maps pixel features into local features with unified receptive fields, and APT, which inserts learnable prompts into the teacher so it can adapt during distillation. On standard face benchmarks and large-scale sets the combined method consistently beats prior KD methods by several points on verification metrics.

Problem Statement

Standard KD techniques work well when teacher and student share similar architectures. When the teacher is a Transformer and the student a CNN, per-pixel spatial attention and receptive fields differ, which breaks feature alignment. Also, off-the-shelf teacher models are not trained to act as teachers and may be hard for students to learn from. The problem: how to distill knowledge effectively across architectures for face recognition.

Main Contribution

Unified Receptive Fields Mapping (URFM): maps teacher and student pixel features to local features with the same receptive fields using learnable local centers plus facial positional encoding.

Adaptable Prompting Teacher (APT): inserts and optimizes learnable prompts inside the pretrained Transformer teacher during distillation so the teacher can manage distillation-specific knowledge.

Key Findings

Cross-architecture KD with URFM+APT substantially improves large-scale verification.

NumbersIJB-C TPR@FPR=1e-4: 94.4 (Ours) vs 89.13 (student baseline) +5.27

Practical UseIf you distill a Swin teacher to a MobileFaceNet student, URFM+APT recovers ~5.3 points in high-security verification rates; use it to raise real-world verification reliability.

Evidence RefTable 1

Combining attention alignment, adaptable prompting and URFM raises small-benchmark accuracy.

NumbersCFP-FP acc: 94.63 (full method) vs 91.3 (FitNet baseline) +3.33

Practical UseSynchronizing receptive fields and allowing a constrained teacher to adapt improves everyday face-pair accuracy by a few points—worth trying when small gains matter.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
IJB-C TPR@FPR=1e-494.4MobileFaceNet student 89.13+5.27IJB-CTable 1: Ours vs student baselineTable 1
IJB-B TPR@FPR=1e-492.48MobileFaceNet student 87.07+5.41IJB-BTable 1: Ours vs student baselineTable 1

What To Try In 7 Days

Prototype URFM: map teacher and student feature maps to a small set of local centers and verify alignment gains on your validation set.

Add prompts to your pretrained Transformer teacher (APT) and fine-tune only prompts during distillation; sweep 5–50 prompts.

Replace plain positional encoding with saliency-based facial positional encoding using landmarks to test small but consistent gains.

Optimization Features

Model Optimization
Cross-architecture feature alignment via URFM
Training Optimization
Prompt-only optimization inside teacher (APT) to limit finetuning

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

MS1MV2 (public training set)LFW, CFP-FP, CPLFW, AgeDB, CALFW, IJB-B, IJB-C, MegaFace (public benchmarks)

Risks & Boundaries

Limitations

Tested only on face recognition; unclear if URFM generalizes to non-face tasks.

Requires facial landmarks for saliency PE; landmark failures may degrade alignment.

When Not To Use

If teacher and student are the same architecture (homologous KD), benefits are smaller.

When you cannot run landmark/keypoint detectors reliably on your data.

Failure Modes

Poor landmark detection leads to bad positional encoding and worse distillation.

Overly many prompts can reduce teacher discriminative power and cause student underperformance.

Core Entities

Models

Swin-S (Transformer teacher)ViT-S (Transformer teacher)MobileFaceNet (CNN student)IResNet-50 (CNN teacher/student)IResNet-18 (CNN student)

Metrics

AccuracyTPR@FPR (e.g., 1e-4, 1e-6)Rank-1 identification

Datasets

MS1MV2 (training)LFWCFP-FPCPLFWAgeDB-30CALFWIJB-BIJB-CMegaFace (FaceScrub probe)

Benchmarks

LFWCFP-FPCPLFWAgeDB-30CALFWIJB-BIJB-CMegaFace