Overview
Method converts pretrained transformers and includes an optimized GPU implementation; the approach is ready for prototyping but needs per-task tuning and validation under distribution shifts.
Citations2
Evidence Strength0.70
Confidence0.78
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
ACMs let you reduce inference compute and GPU latency while retaining model accuracy, enabling cheaper, faster deployment of pretrained transformers in latency- or energy-constrained settings.
Who Should Care
Summary TLDR
The paper introduces Adaptive Computation Modules (ACMs): replace selected transformer blocks with a small ordered set of lightweight submodules (“learners”) and a per-token gate that selects how many learners to run. ACMs let the model spend less compute on easy tokens and more on hard ones. Converted ViT and Wav2Vec models show lower FLOPs and wall-clock latency for a range of user budgets while keeping accuracy roughly the same. The authors provide a three-phase conversion (distill learners, pretrain gates, end-to-end finetune) and an optimized Triton GPU implementation.
Problem Statement
Transformer layers often provide full width (all parameters) for every input token. Many tokens do not need the full layer capacity, so models waste compute. The paper asks: can we adapt width per token to cut inference cost while keeping accuracy?
Main Contribution
Adaptive Computation Module (ACM): an ordered set of learners plus a small gating net that selects how many learners to run per token.
A conversion recipe to turn pretrained transformers into ACMized variants: module-wise distillation, gate pretraining with artificial labels, then end-to-end finetuning.
Key Findings
ACMized ViT-B achieves the Pareto frontier of FLOPs vs accuracy on ImageNet-1k.
On CommonVoice-es speech recognition, ACMized Wav2Vec models achieve lower word error rate at every tested compute budget.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ACMized ViT-B on Pareto frontier; favorable across budgets, especially <12.5 GFLOPs | Original ViT-B and other conditional compute methods (A-ViT, MoEfication, Zero Time Waste) | Better accuracy-vs-FLOPs trade-off under low-FLOPs targets (see Fig.3) | ImageNet-1k validation | Figure 3 | Figure 3 |
| Speech recognition WER vs compute | ACMized Wav2Vec achieves lower WER at every tested computational budget | MoEfication | Lower WER across budgets (see Fig.4) | CommonVoice (es) validation | Figure 4 | Figure 4 |
What To Try In 7 Days
Convert one MLP block of a small ViT to an ACM (N=4) and run the authors' 3-phase distillation + finetune for a few epochs.
Measure average FLOPs and wall-clock latency on your A100 (or target GPU) and compare to baseline.
If you use Triton, implement the gated forward pass and test latency/sorting gains on batched inputs.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires three-phase conversion and finetuning; random init finetune converges slower.
Performance and routing are sensitive to distribution shift (ImageNet-C showed gating changes and accuracy drops).
When Not To Use
When you cannot retrain or finetune the model at all.
Small models where gating overhead outweighs savings.
Failure Modes
Gate collapse: gates choose same number of learners for all tokens, nullifying adaptivity.
Severe domain shift causes gates to select extreme budgets and degrade accuracy.

