EdgeTran co-designs transformer models and edge devices to cut latency, energy and peak power for mobile inference

Overview

Decision SnapshotNeeds Validation

Paper provides a full system and empirical numbers on multiple edge devices; surrogate accuracy and co-design gains are demonstrated but require nontrivial profiling and compute to reproduce.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shikhar Tuli, Niraj K. Jha

Links

Abstract / PDF / Data

Why It Matters For Business

Co-designing model and device cuts operational energy and peak power by an order of magnitude while keeping or slightly improving accuracy — lowering battery drain, thermal limits and cloud costs for on-device NLP.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper presents a practical pipeline to jointly search transformer architectures and edge devices to meet accuracy and hardware targets. Key pieces: FlexiBERT 2.0 (a very large, heterogeneous transformer design space), ProTran (profiles latency, energy, and peak power and trains lightweight surrogates), BOSHCODE/EdgeTran (Bayesian co-design search for model-device pairs), and GPTran (block-level grow-and-prune post-processing). On GLUE, the final model (ET*) is 2.8× smaller than BERT-Base and yields 0.8% higher GLUE score while running on a selected Apple M1 GPU gives ~15% lower latency, ~10× lower energy, and ~10.8× lower peak power versus the server GPU baseline (Nvidia A100). FlexiB E

Problem Statement

Large transformer models are impractical on low-power edge devices because they can exceed latency, energy, and peak power budgets. Existing NAS or pruning work often optimizes only parameters or FLOPs and ignores energy and peak power across diverse mobile hardware. Searching this huge joint space (models × devices) needs fast profiling and surrogate models plus a codesign search that balances accuracy with latency, energy, and peak power.

Main Contribution

FlexiBERT 2.0: an expanded heterogeneous transformer design space (reported size ~1.7×10^88) and faster, finer-grained weight-transfer.

ProTran: an active-learning profiler that measures and models latency, energy, and peak power across multiple edge platforms and trains surrogates for fast queries.

Key Findings

Final co-designed model (ET*) is 2.8× smaller than BERT-Base and improves GLUE by 0.8 percentage points

Numbers39.6M vs 110M params; GLUE 80.4% vs 79.6%

Practical UseYou can deploy a much smaller transformer with equal-or-better GLUE accuracy on edge devices instead of shipping full BERT-Base.

Evidence RefTable VIII; abstract

On the chosen edge device (Apple M1 GPU) ET* reduces latency, energy, and peak power compared to BERT-Base on an A100 server GPU

NumbersLatency -15.0%; Energy 10.0× lower; Peak power 10.8× lower

Practical UseCo-design yields large operational savings (battery and thermal) while keeping or slightly improving accuracy — useful for mobile real-time apps.

Evidence RefTable VIII; abstract

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GLUE score	80.4%	BERT-Base 79.6%	+0.8%	GLUE (multi-task)	ET* after EdgeTran + GPTran	Table VIII; Section V-F
Model size (parameters)	39.6M	BERT-Base 110M	2.8× smaller	—	ET* final model	Table VIII; abstract

What To Try In 7 Days

Profile your target device with ProTran (few hours) to get latency/energy/power surrogates.

Run a constrained search in FlexiBERT 2.0 using EdgeTran to find a model-device pair for your latency and power budget.

Apply GPTran grow-and-prune on the converged model to squeeze extra savings in size and pretraining loss.

Optimization Features

Infra Optimization

GBDT surrogates for hardware predictionheteroscedastic neural surrogates for aleatoric uncertainty

Model Optimization

heterogeneous transformer layers (multiple attention types)block-level grow-and-prune (GPTran)attention-head–level weight transfer (ordered or random projection)

System Optimization

multi-framework export: PyTorch, TensorFlow, ONNX, OpenVINOactive-learning sampling to minimize hardware measurements

Training Optimization

weight transfer to avoid full retrainautotuned fine-tuning per model using TPE

Inference Optimization

Accuracysurrogate models for fast estimation of hardware metrics

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

GLUE (public)BookCorpus (public)Wikipedia (public)OpenWebText (public)CC-News (public)

Risks & Boundaries

Limitations

FlexiBERT 2.0 pretraining and surrogate construction is expensive (reported ~100 GPU-days).

Surrogates assume fixed batch size; changing batch sizes needs retraining of predictors.

When Not To Use

When you cannot afford initial profiling and surrogate training (no GPU budget).

When the target device runs multiple dynamic workloads whose interference you cannot pre-profile.

Failure Modes

Surrogate model mismatch: predictors may mis-rank architectures on unseen device behaviors.

Weight transfer may underperform for large topology changes, requiring longer fine-tuning.

Core Entities

Models

FlexiBERT 2.0ProTranEdgeTran (BOSHCODE)GPTranET (ET*)BOSHNAS

Metrics

GLUE scoreLatency (ms/seq)Energy (J/seq)Peak power (W)Pre-training MLM lossSurrogate MSE

Datasets

GLUEBookCorpusWikipedia (English)OpenWebTextCC-NewsSST-2

Benchmarks

GLUE

Context Entities

Models

BERT-BaseBERT-TinyHATAutoTinyBERT

Metrics

FLOPsParameter countThroughput

Datasets

MNLIMRPCRTESTS-BQQPQNLICoLAWNLI

Benchmarks

previous HW-NAS approaches (HAT)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Final co-designed model (ET*) is 2.8× smaller than BERT-Base and improves GLUE by 0.8 percentage points

On the chosen edge device (Apple M1 GPU) ET* reduces latency, energy, and peak power compared to BERT-Base on an A100 server GPU

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding