EdgeTran co-designs transformer models and edge devices to cut latency, energy and peak power for mobile inference

March 24, 20238 min

Overview

Decision SnapshotNeeds Validation

Paper provides a full system and empirical numbers on multiple edge devices; surrogate accuracy and co-design gains are demonstrated but require nontrivial profiling and compute to reproduce.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shikhar Tuli, Niraj K. Jha

Links

Abstract / PDF / Data

Why It Matters For Business

Co-designing model and device cuts operational energy and peak power by an order of magnitude while keeping or slightly improving accuracy — lowering battery drain, thermal limits and cloud costs for on-device NLP.

Who Should Care

Summary TLDR

This paper presents a practical pipeline to jointly search transformer architectures and edge devices to meet accuracy and hardware targets. Key pieces: FlexiBERT 2.0 (a very large, heterogeneous transformer design space), ProTran (profiles latency, energy, and peak power and trains lightweight surrogates), BOSHCODE/EdgeTran (Bayesian co-design search for model-device pairs), and GPTran (block-level grow-and-prune post-processing). On GLUE, the final model (ET*) is 2.8× smaller than BERT-Base and yields 0.8% higher GLUE score while running on a selected Apple M1 GPU gives ~15% lower latency, ~10× lower energy, and ~10.8× lower peak power versus the server GPU baseline (Nvidia A100). FlexiB E

Problem Statement

Large transformer models are impractical on low-power edge devices because they can exceed latency, energy, and peak power budgets. Existing NAS or pruning work often optimizes only parameters or FLOPs and ignores energy and peak power across diverse mobile hardware. Searching this huge joint space (models × devices) needs fast profiling and surrogate models plus a codesign search that balances accuracy with latency, energy, and peak power.

Main Contribution

FlexiBERT 2.0: an expanded heterogeneous transformer design space (reported size ~1.7×10^88) and faster, finer-grained weight-transfer.

ProTran: an active-learning profiler that measures and models latency, energy, and peak power across multiple edge platforms and trains surrogates for fast queries.

Key Findings

Final co-designed model (ET*) is 2.8× smaller than BERT-Base and improves GLUE by 0.8 percentage points

Numbers39.6M vs 110M params; GLUE 80.4% vs 79.6%

Practical UseYou can deploy a much smaller transformer with equal-or-better GLUE accuracy on edge devices instead of shipping full BERT-Base.

Evidence RefTable VIII; abstract

On the chosen edge device (Apple M1 GPU) ET* reduces latency, energy, and peak power compared to BERT-Base on an A100 server GPU

NumbersLatency -15.0%; Energy 10.0× lower; Peak power 10.8× lower

Practical UseCo-design yields large operational savings (battery and thermal) while keeping or slightly improving accuracy — useful for mobile real-time apps.

Evidence RefTable VIII; abstract

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GLUE score80.4%BERT-Base 79.6%+0.8%GLUE (multi-task)ET* after EdgeTran + GPTranTable VIII; Section V-F
Model size (parameters)39.6MBERT-Base 110M2.8× smallerET* final modelTable VIII; abstract

What To Try In 7 Days

Profile your target device with ProTran (few hours) to get latency/energy/power surrogates.

Run a constrained search in FlexiBERT 2.0 using EdgeTran to find a model-device pair for your latency and power budget.

Apply GPTran grow-and-prune on the converged model to squeeze extra savings in size and pretraining loss.

Optimization Features

Infra Optimization
GBDT surrogates for hardware predictionheteroscedastic neural surrogates for aleatoric uncertainty
Model Optimization
heterogeneous transformer layers (multiple attention types)block-level grow-and-prune (GPTran)attention-head–level weight transfer (ordered or random projection)
System Optimization
multi-framework export: PyTorch, TensorFlow, ONNX, OpenVINOactive-learning sampling to minimize hardware measurements
Training Optimization
weight transfer to avoid full retrainautotuned fine-tuning per model using TPE
Inference Optimization
Accuracysurrogate models for fast estimation of hardware metrics

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

GLUE (public)BookCorpus (public)Wikipedia (public)OpenWebText (public)CC-News (public)

Risks & Boundaries

Limitations

FlexiBERT 2.0 pretraining and surrogate construction is expensive (reported ~100 GPU-days).

Surrogates assume fixed batch size; changing batch sizes needs retraining of predictors.

When Not To Use

When you cannot afford initial profiling and surrogate training (no GPU budget).

When the target device runs multiple dynamic workloads whose interference you cannot pre-profile.

Failure Modes

Surrogate model mismatch: predictors may mis-rank architectures on unseen device behaviors.

Weight transfer may underperform for large topology changes, requiring longer fine-tuning.

Core Entities

Models

FlexiBERT 2.0ProTranEdgeTran (BOSHCODE)GPTranET (ET*)BOSHNAS

Metrics

GLUE scoreLatency (ms/seq)Energy (J/seq)Peak power (W)Pre-training MLM lossSurrogate MSE

Datasets

GLUEBookCorpusWikipedia (English)OpenWebTextCC-NewsSST-2

Benchmarks

GLUE

Context Entities

Models

BERT-BaseBERT-TinyHATAutoTinyBERT

Metrics

FLOPsParameter countThroughput

Datasets

MNLIMRPCRTESTS-BQQPQNLICoLAWNLI

Benchmarks

previous HW-NAS approaches (HAT)