Combine pruning, distillation and post-training quantization to run a ViT-style segmenter on a 4GB Jetson Nano with small accuracy loss

September 5, 20238 min

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and tested on-device, but results are scoped to one small UAV dataset, depend on PyTorch kernel behavior, and rely on swap for memory, so expect extra engineering before robust production use.

Citations3

Evidence Strength0.60

Confidence0.70

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 45%

Novelty: 40%

Authors

Eric Youn, Sai Mitheran J, Sanjana Prabhu, Siyuan Chen

Links

Abstract / PDF / Code

Why It Matters For Business

You can run a practical ViT-style segmenter on a $100–$150 Jetson Nano by combining distillation and fp16 quantization, giving near-teacher accuracy while keeping model size and RAM within real device limits.

Who Should Care

Summary TLDR

The authors build a practical compression pipeline (structured pruning of heads/weights, distillation from a Swin teacher, and fp16/post-training quantization) to deploy a lightweight ViT-based segmentation model on an NVIDIA Jetson Nano (4GB). The final UPerNet + MobileViT student reaches about 0.61 mean IoU on the LPCV UAV disaster parsing dataset with 5.6M params and runs on the Jetson with ~3–4.5 FPS while consuming ~3.7 GB RAM (uses swap). Pruning MobileViT is fragile (accuracy drops early); distillation recovers ~+0.06 IoU; fp16 helped MobileViT throughput but not ResNet18. Code is available.

Problem Statement

Vision transformers give high segmentation accuracy but are too large for battery- and memory-constrained edge devices (e.g., a Jetson Nano with 4GB). The goal is to compress ViT-style models so they run quickly on-device with minimal loss of segmentation accuracy for UAV disaster scenes.

Main Contribution

A unified, practical compression pipeline combining structured pruning (heads/filters/linear rows), logit+feature distillation, and post-training/fp16 quantization targeted at edge deployment.

Empirical evaluation on the LPCV 2023 UAV disaster scene parsing dataset showing a distilled + fp16-quantized UPerNet+MobileViT student that approaches teacher accuracy while fitting a Jetson Nano memory budget.

Key Findings

Distillation substantially boosts MobileViT segmentation accuracy.

NumbersMIoU from 0.5365 to 0.6056 (+0.069)

Practical UseIf you need a small ViT for segmentation, add distillation from a stronger Swin teacher — expect ~+6–7 percentage points MIoU on this dataset.

Evidence RefTable 2; section 4.1

Final compressed model fits and runs on a 4GB Jetson Nano but uses swap.

NumbersUPerNet+MobileViT (KD, fp16): 5.6M params; uses 3742 MB RAM + 1030 MB swap

Practical UseYou can deploy this student on a 4GB Nano, but expect to rely on swap and further optimizations (TensorRT/ONNX) to reduce memory pressure.

Evidence RefSection 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
mean IoU (UPerNet + MobileViT, baseline)0.5365LPCVTable 2; UPerNet (MobileViT) MIoU 0.5365Table 2
mean IoU (UPerNet + MobileViT, +KD from Swin-v2-T)0.60560.5365+0.0691LPCVTable 2; distillation raises MIoU to 0.6056Table 2; Section 4.1

What To Try In 7 Days

Pick a strong off-the-shelf teacher (Swin-v2-T) and train a MobileViT student via logit+feature distillation on your task data.

Measure on-device RAM and FPS on your Jetson; convert student to fp16 and rerun to check throughput gains.

Avoid aggressive pruning on already compact MobileViTs; try head pruning only if you can change tensor shapes or use optimized kernels.

Optimization Features

Infra Optimization
targeted deployment on Jetson Nano (4GB)
Model Optimization
structured pruning (heads/filters/linear rows)head pruning (zeroing heads)distillation (logit + feature-level)
System Optimization
measure RAM and swap on Jetson Nanorecommend moving to optimized inference engines
Training Optimization
knowledge distillation from Swin teacherfine-tuning after pruning
Inference Optimization
fp16 quantizationpost-training quantization (limited by PyTorch)recommendation to use ONNX/TensorRT for further gains

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments run on a single, small LPCV dataset (1,120 train images) — small-data effects may bias results.

Final on-device setup used swap (memory pressure); pure 4GB RAM headroom is limited.

When Not To Use

When you need per-class performance on rare but critical classes (low-frequency classes suffer under compression).

When you cannot tolerate swap usage or have stricter memory-latency SLAs.

Failure Modes

Loss of fine-grain details and low-frequency class recall after compression.

No runtime improvement from pruning if tensor shapes are not reduced.

Core Entities

Models

MobileViTSwin-v2-TSwin-v2-BResNet18UPerNetDeepLabv3FANet

Metrics

mean IoUAccuracyFPSmodel size (params)RAM usage

Datasets

LPCV 2023 UAV disaster scene datasetImageNet-1kADE20K

Benchmarks

LPCV 2023 challengeADE20K