Overview
The pipeline is practical and tested on-device, but results are scoped to one small UAV dataset, depend on PyTorch kernel behavior, and rely on swap for memory, so expect extra engineering before robust production use.
Citations3
Evidence Strength0.60
Confidence0.70
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 45%
Novelty: 40%
Why It Matters For Business
You can run a practical ViT-style segmenter on a $100–$150 Jetson Nano by combining distillation and fp16 quantization, giving near-teacher accuracy while keeping model size and RAM within real device limits.
Who Should Care
Summary TLDR
The authors build a practical compression pipeline (structured pruning of heads/weights, distillation from a Swin teacher, and fp16/post-training quantization) to deploy a lightweight ViT-based segmentation model on an NVIDIA Jetson Nano (4GB). The final UPerNet + MobileViT student reaches about 0.61 mean IoU on the LPCV UAV disaster parsing dataset with 5.6M params and runs on the Jetson with ~3–4.5 FPS while consuming ~3.7 GB RAM (uses swap). Pruning MobileViT is fragile (accuracy drops early); distillation recovers ~+0.06 IoU; fp16 helped MobileViT throughput but not ResNet18. Code is available.
Problem Statement
Vision transformers give high segmentation accuracy but are too large for battery- and memory-constrained edge devices (e.g., a Jetson Nano with 4GB). The goal is to compress ViT-style models so they run quickly on-device with minimal loss of segmentation accuracy for UAV disaster scenes.
Main Contribution
A unified, practical compression pipeline combining structured pruning (heads/filters/linear rows), logit+feature distillation, and post-training/fp16 quantization targeted at edge deployment.
Empirical evaluation on the LPCV 2023 UAV disaster scene parsing dataset showing a distilled + fp16-quantized UPerNet+MobileViT student that approaches teacher accuracy while fitting a Jetson Nano memory budget.
Key Findings
Distillation substantially boosts MobileViT segmentation accuracy.
Final compressed model fits and runs on a 4GB Jetson Nano but uses swap.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| mean IoU (UPerNet + MobileViT, baseline) | 0.5365 | — | — | LPCV | Table 2; UPerNet (MobileViT) MIoU 0.5365 | Table 2 |
| mean IoU (UPerNet + MobileViT, +KD from Swin-v2-T) | 0.6056 | 0.5365 | +0.0691 | LPCV | Table 2; distillation raises MIoU to 0.6056 | Table 2; Section 4.1 |
What To Try In 7 Days
Pick a strong off-the-shelf teacher (Swin-v2-T) and train a MobileViT student via logit+feature distillation on your task data.
Measure on-device RAM and FPS on your Jetson; convert student to fp16 and rerun to check throughput gains.
Avoid aggressive pruning on already compact MobileViTs; try head pruning only if you can change tensor shapes or use optimized kernels.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments run on a single, small LPCV dataset (1,120 train images) — small-data effects may bias results.
Final on-device setup used swap (memory pressure); pure 4GB RAM headroom is limited.
When Not To Use
When you need per-class performance on rare but critical classes (low-frequency classes suffer under compression).
When you cannot tolerate swap usage or have stricter memory-latency SLAs.
Failure Modes
Loss of fine-grain details and low-frequency class recall after compression.
No runtime improvement from pruning if tensor shapes are not reduced.

