Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
Co-designing model and device cuts operational energy and peak power by an order of magnitude while keeping or slightly improving accuracy — lowering battery drain, thermal limits and cloud costs for on-device NLP.
Summary TLDR
This paper presents a practical pipeline to jointly search transformer architectures and edge devices to meet accuracy and hardware targets. Key pieces: FlexiBERT 2.0 (a very large, heterogeneous transformer design space), ProTran (profiles latency, energy, and peak power and trains lightweight surrogates), BOSHCODE/EdgeTran (Bayesian co-design search for model-device pairs), and GPTran (block-level grow-and-prune post-processing). On GLUE, the final model (ET*) is 2.8× smaller than BERT-Base and yields 0.8% higher GLUE score while running on a selected Apple M1 GPU gives ~15% lower latency, ~10× lower energy, and ~10.8× lower peak power versus the server GPU baseline (Nvidia A100). FlexiB E
Problem Statement
Large transformer models are impractical on low-power edge devices because they can exceed latency, energy, and peak power budgets. Existing NAS or pruning work often optimizes only parameters or FLOPs and ignores energy and peak power across diverse mobile hardware. Searching this huge joint space (models × devices) needs fast profiling and surrogate models plus a codesign search that balances accuracy with latency, energy, and peak power.
Main Contribution
FlexiBERT 2.0: an expanded heterogeneous transformer design space (reported size ~1.7×10^88) and faster, finer-grained weight-transfer.
ProTran: an active-learning profiler that measures and models latency, energy, and peak power across multiple edge platforms and trains surrogates for fast queries.
EdgeTran / BOSHCODE: a hardware-aware co-design search that jointly selects transformer architectures and edge devices using heteroscedastic surrogates and Bayesian optimization.
GPTran: block-level grow-and-prune post-processing that improves pre-training loss and reduces parameters with hardware awareness.
Key Findings
Final co-designed model (ET*) is 2.8× smaller than BERT-Base and improves GLUE by 0.8 percentage points
On the chosen edge device (Apple M1 GPU) ET* reduces latency, energy, and peak power compared to BERT-Base on an A100 server GPU
ProTran surrogate models (GBDT) can reach low error with a few dozen–hundreds of measured models per platform
GPTran reduced pre-training MLM loss by 1.4% while also cutting parameters by 5.9%
Apple M1 SoC (integrated GPU) has far lower energy and peak power than an A100 GPU at similar batch throughput
Results
GLUE score
Model size (parameters)
Latency (ms/seq)
Energy (J/seq)
Peak power (W)
Accuracy
Who Should Care
What To Try In 7 Days
Profile your target device with ProTran (few hours) to get latency/energy/power surrogates.
Run a constrained search in FlexiBERT 2.0 using EdgeTran to find a model-device pair for your latency and power budget.
Apply GPTran grow-and-prune on the converged model to squeeze extra savings in size and pretraining loss.
Optimization Features
Infra Optimization
- GBDT surrogates for hardware prediction
- heteroscedastic neural surrogates for aleatoric uncertainty
Model Optimization
- heterogeneous transformer layers (multiple attention types)
- block-level grow-and-prune (GPTran)
- attention-head–level weight transfer (ordered or random projection)
System Optimization
- multi-framework export: PyTorch, TensorFlow, ONNX, OpenVINO
- active-learning sampling to minimize hardware measurements
Training Optimization
- weight transfer to avoid full retrain
- autotuned fine-tuning per model using TPE
Inference Optimization
- Accuracy
- surrogate models for fast estimation of hardware metrics
Reproducibility
Data Urls
- GLUE (public)
- BookCorpus (public)
- Wikipedia (public)
- OpenWebText (public)
- CC-News (public)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- FlexiBERT 2.0 pretraining and surrogate construction is expensive (reported ~100 GPU-days).
- Surrogates assume fixed batch size; changing batch sizes needs retraining of predictors.
- Dynamic device workloads and OS-level scheduling effects are not modeled.
- No public code or release is stated in the paper for immediate reuse.
When Not To Use
- When you cannot afford initial profiling and surrogate training (no GPU budget).
- When the target device runs multiple dynamic workloads whose interference you cannot pre-profile.
- When you require custom low-level scheduling or a specialized accelerator not supported by the export formats.
Failure Modes
- Surrogate model mismatch: predictors may mis-rank architectures on unseen device behaviors.
- Weight transfer may underperform for large topology changes, requiring longer fine-tuning.
- Device drivers or framework scheduling differences can change measured latency/energy from profiled values.
- NPU targets (e.g., Intel NCS) may show low power but orders-of-magnitude higher latency, breaking real-time requirements.
Core Entities
Models
- FlexiBERT 2.0
- ProTran
- EdgeTran (BOSHCODE)
- GPTran
- ET (ET*)
- BOSHNAS
Metrics
- GLUE score
- Latency (ms/seq)
- Energy (J/seq)
- Peak power (W)
- Pre-training MLM loss
- Surrogate MSE
Datasets
- GLUE
- BookCorpus
- Wikipedia (English)
- OpenWebText
- CC-News
- SST-2
Benchmarks
- GLUE
Context Entities
Models
- BERT-Base
- BERT-Tiny
- HAT
- AutoTinyBERT
Metrics
- FLOPs
- Parameter count
- Throughput
Datasets
- MNLI
- MRPC
- RTE
- STS-B
- QQP
- QNLI
- CoLA
- WNLI
Benchmarks
- previous HW-NAS approaches (HAT)

