Combine pruning, distillation and post-training quantization to run a ViT-style segmenter on a 4GB Jetson Nano with small accuracy loss

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and tested on-device, but results are scoped to one small UAV dataset, depend on PyTorch kernel behavior, and rely on swap for memory, so expect extra engineering before robust production use.

Citations3

Evidence Strength0.60

Confidence0.70

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 45%

Novelty: 40%

Authors

Eric Youn, Sai Mitheran J, Sanjana Prabhu, Siyuan Chen

Links

Abstract / PDF / Code

Why It Matters For Business

You can run a practical ViT-style segmenter on a $100–$150 Jetson Nano by combining distillation and fp16 quantization, giving near-teacher accuracy while keeping model size and RAM within real device limits.

Who Should Care

ML Engineer Product Manager Founder

Summary TLDR

The authors build a practical compression pipeline (structured pruning of heads/weights, distillation from a Swin teacher, and fp16/post-training quantization) to deploy a lightweight ViT-based segmentation model on an NVIDIA Jetson Nano (4GB). The final UPerNet + MobileViT student reaches about 0.61 mean IoU on the LPCV UAV disaster parsing dataset with 5.6M params and runs on the Jetson with ~3–4.5 FPS while consuming ~3.7 GB RAM (uses swap). Pruning MobileViT is fragile (accuracy drops early); distillation recovers ~+0.06 IoU; fp16 helped MobileViT throughput but not ResNet18. Code is available.

Problem Statement

Vision transformers give high segmentation accuracy but are too large for battery- and memory-constrained edge devices (e.g., a Jetson Nano with 4GB). The goal is to compress ViT-style models so they run quickly on-device with minimal loss of segmentation accuracy for UAV disaster scenes.

Main Contribution

A unified, practical compression pipeline combining structured pruning (heads/filters/linear rows), logit+feature distillation, and post-training/fp16 quantization targeted at edge deployment.

Empirical evaluation on the LPCV 2023 UAV disaster scene parsing dataset showing a distilled + fp16-quantized UPerNet+MobileViT student that approaches teacher accuracy while fitting a Jetson Nano memory budget.

Key Findings

Distillation substantially boosts MobileViT segmentation accuracy.

NumbersMIoU from 0.5365 to 0.6056 (+0.069)

Practical UseIf you need a small ViT for segmentation, add distillation from a stronger Swin teacher — expect ~+6–7 percentage points MIoU on this dataset.

Evidence RefTable 2; section 4.1

Final compressed model fits and runs on a 4GB Jetson Nano but uses swap.

NumbersUPerNet+MobileViT (KD, fp16): 5.6M params; uses 3742 MB RAM + 1030 MB swap

Practical UseYou can deploy this student on a 4GB Nano, but expect to rely on swap and further optimizations (TensorRT/ONNX) to reduce memory pressure.

Evidence RefSection 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
mean IoU (UPerNet + MobileViT, baseline)	0.5365	—	—	LPCV	Table 2; UPerNet (MobileViT) MIoU 0.5365	Table 2
mean IoU (UPerNet + MobileViT, +KD from Swin-v2-T)	0.6056	0.5365	+0.0691	LPCV	Table 2; distillation raises MIoU to 0.6056	Table 2; Section 4.1

What To Try In 7 Days

Pick a strong off-the-shelf teacher (Swin-v2-T) and train a MobileViT student via logit+feature distillation on your task data.

Measure on-device RAM and FPS on your Jetson; convert student to fp16 and rerun to check throughput gains.

Avoid aggressive pruning on already compact MobileViTs; try head pruning only if you can change tensor shapes or use optimized kernels.

Optimization Features

Infra Optimization

targeted deployment on Jetson Nano (4GB)

Model Optimization

structured pruning (heads/filters/linear rows)head pruning (zeroing heads)distillation (logit + feature-level)

System Optimization

measure RAM and swap on Jetson Nanorecommend moving to optimized inference engines

Training Optimization

knowledge distillation from Swin teacherfine-tuning after pruning

Inference Optimization

fp16 quantizationpost-training quantization (limited by PyTorch)recommendation to use ONNX/TensorRT for further gains

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/chensy7/efficient-vit

Risks & Boundaries

Limitations

Experiments run on a single, small LPCV dataset (1,120 train images) — small-data effects may bias results.

Final on-device setup used swap (memory pressure); pure 4GB RAM headroom is limited.

When Not To Use

When you need per-class performance on rare but critical classes (low-frequency classes suffer under compression).

When you cannot tolerate swap usage or have stricter memory-latency SLAs.

Failure Modes

Loss of fine-grain details and low-frequency class recall after compression.

No runtime improvement from pruning if tensor shapes are not reduced.

Core Entities

Models

MobileViTSwin-v2-TSwin-v2-BResNet18UPerNetDeepLabv3FANet

Metrics

mean IoUAccuracyFPSmodel size (params)RAM usage

Datasets

LPCV 2023 UAV disaster scene datasetImageNet-1kADE20K

Benchmarks

LPCV 2023 challengeADE20K

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Distillation substantially boosts MobileViT segmentation accuracy.

Final compressed model fits and runs on a 4GB Jetson Nano but uses swap.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Key finding

Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Key finding