Combine pruning, distillation and post-training quantization to run a ViT-style segmenter on a 4GB Jetson Nano with small accuracy loss

September 5, 20238 min

Overview

Production Readiness

0.45

Novelty Score

0.4

Cost Impact Score

0.65

Citation Count

3

Authors

Eric Youn, Sai Mitheran J, Sanjana Prabhu, Siyuan Chen

Links

Abstract / PDF

Why It Matters For Business

You can run a practical ViT-style segmenter on a $100–$150 Jetson Nano by combining distillation and fp16 quantization, giving near-teacher accuracy while keeping model size and RAM within real device limits.

Summary TLDR

The authors build a practical compression pipeline (structured pruning of heads/weights, distillation from a Swin teacher, and fp16/post-training quantization) to deploy a lightweight ViT-based segmentation model on an NVIDIA Jetson Nano (4GB). The final UPerNet + MobileViT student reaches about 0.61 mean IoU on the LPCV UAV disaster parsing dataset with 5.6M params and runs on the Jetson with ~3–4.5 FPS while consuming ~3.7 GB RAM (uses swap). Pruning MobileViT is fragile (accuracy drops early); distillation recovers ~+0.06 IoU; fp16 helped MobileViT throughput but not ResNet18. Code is available.

Problem Statement

Vision transformers give high segmentation accuracy but are too large for battery- and memory-constrained edge devices (e.g., a Jetson Nano with 4GB). The goal is to compress ViT-style models so they run quickly on-device with minimal loss of segmentation accuracy for UAV disaster scenes.

Main Contribution

A unified, practical compression pipeline combining structured pruning (heads/filters/linear rows), logit+feature distillation, and post-training/fp16 quantization targeted at edge deployment.

Empirical evaluation on the LPCV 2023 UAV disaster scene parsing dataset showing a distilled + fp16-quantized UPerNet+MobileViT student that approaches teacher accuracy while fitting a Jetson Nano memory budget.

A focused ablation: effects of pruning types (filter/channel/unstructured), head pruning, iterative pruning, and fp16 quantization on accuracy, speed, and memory.

Key Findings

Distillation substantially boosts MobileViT segmentation accuracy.

NumbersMIoU from 0.5365 to 0.6056 (+0.069)

Final compressed model fits and runs on a 4GB Jetson Nano but uses swap.

NumbersUPerNet+MobileViT (KD, fp16): 5.6M params; uses 3742 MB RAM + 1030 MB swap

fp16 quantization increased MobileViT throughput but not ResNet18.

NumbersMobileViT Jetson FPS 3.11 -> 4.49 (≈1.44x); ResNet18 throughput decreased

Pruning MobileViT degrades accuracy quickly; Swin tolerates more sparsity.

NumbersMobileViT accuracy drops starting at 0.2 sparsity; Swin models degrade only around 0.7 sparsity

Head pruning reduces theoretical params but not runtime when dimensions stay the same.

NumbersMobileViT heads 4->2: MIoU 0.545 -> 0.533 (−0.012), params 5.6M -> 4.6M, no FPS gain

Iterative pruning gave only marginal gains over bulk pruning on MobileViT.

NumbersIterative schemes showed small improvements; overall compression remained limited

Results

mean IoU (UPerNet + MobileViT, baseline)

Value0.5365

mean IoU (UPerNet + MobileViT, +KD from Swin-v2-T)

Value0.6056

Baseline0.5365

parameters (UPerNet + MobileViT)

Value5.6M

Jetson Nano RAM usage (final model)

Value3742 MB RAM (+1030 MB swap)

Jetson Nano throughput (MobileViT, fp32 -> fp16)

Value3.11 FPS -> 4.49 FPS

Baseline3.11 FPS

Pruning sensitivity (MobileViT)

Valueaccuracy drops starting at 0.2 sparsity

Head pruning example (MobileViT 4->2 heads)

ValueMIoU 0.545 -> 0.533; params 5.6M -> 4.6M; no FPS gain

Baseline4 heads

Who Should Care

What To Try In 7 Days

Pick a strong off-the-shelf teacher (Swin-v2-T) and train a MobileViT student via logit+feature distillation on your task data.

Measure on-device RAM and FPS on your Jetson; convert student to fp16 and rerun to check throughput gains.

Avoid aggressive pruning on already compact MobileViTs; try head pruning only if you can change tensor shapes or use optimized kernels.

Optimization Features

Infra Optimization

  • targeted deployment on Jetson Nano (4GB)

Model Optimization

  • structured pruning (heads/filters/linear rows)
  • head pruning (zeroing heads)
  • distillation (logit + feature-level)

System Optimization

  • measure RAM and swap on Jetson Nano
  • recommend moving to optimized inference engines

Training Optimization

  • knowledge distillation from Swin teacher
  • fine-tuning after pruning

Inference Optimization

  • fp16 quantization
  • post-training quantization (limited by PyTorch)
  • recommendation to use ONNX/TensorRT for further gains

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run on a single, small LPCV dataset (1,120 train images) — small-data effects may bias results.
  • Final on-device setup used swap (memory pressure); pure 4GB RAM headroom is limited.
  • PyTorch has limited quantized CUDA support; reported speed/memory depends on current kernel implementations.
  • Pruning did not change attention tensor shapes, so no real runtime win from head pruning in these experiments.

When Not To Use

  • When you need per-class performance on rare but critical classes (low-frequency classes suffer under compression).
  • When you cannot tolerate swap usage or have stricter memory-latency SLAs.
  • When your deployment stack cannot run fp16 or lacks optimized attention kernels.

Failure Modes

  • Loss of fine-grain details and low-frequency class recall after compression.
  • No runtime improvement from pruning if tensor shapes are not reduced.
  • Quantization or fp16 may not speed up all backbones due to kernel differences (ResNet vs attention).
  • Iterative pruning may not yield meaningful extra compression for already compact student models.

Core Entities

Models

  • MobileViT
  • Swin-v2-T
  • Swin-v2-B
  • ResNet18
  • UPerNet
  • DeepLabv3
  • FANet

Metrics

  • mean IoU
  • Accuracy
  • FPS
  • model size (params)
  • RAM usage

Datasets

  • LPCV 2023 UAV disaster scene dataset
  • ImageNet-1k
  • ADE20K

Benchmarks

  • LPCV 2023 challenge
  • ADE20K