Overview
Paper presents profiling evidence and an architecture proposal; practical gains need engineering validation in a real serving stack with router and cache changes.
Citations0
Evidence Strength0.60
Confidence0.75
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.
Who Should Care
Summary TLDR
The paper profiles modern transformers and finds many layers make tiny, token-dependent contributions. It introduces Radial Networks: a new architecture that routes each token between layers with a small trained router, allowing per-token skipping and layer reuse. Profiling on OPT and ViT models shows median residual contributions fall with model size (e.g., OPT-125M ~20% vs OPT-66B ~5.9%), so large models can safely skip many layers. Radial Networks add a unified global KV cache and can be trained from scratch or distilled from sequential models to cut average depth, compute, and serving costs while keeping capacity high.
Problem Statement
Modern transformers are getting deeper but individual layers often contribute little on a per-token basis. Existing methods (early-exit, width sparsity) either skip wrong layers or need special training. The challenge is to exploit per-token, per-layer variability to reduce compute and latency without large accuracy loss and without changing attention key-value semantics.
Main Contribution
Profiling residual blocks in OPT and ViT families to quantify per-token layer importance using a simple residual-ratio proxy.
Empirical trend: larger models have smaller median residual contributions, creating more opportunities to skip layers at runtime.
Key Findings
Per-layer residual contributions shrink as model size grows.
Many tokens do not need the full network depth during generation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| residual ratio (median) | OPT-125M ≈ 20%; OPT-66B ≈ 5.9% | — | — | WikiText-2 tokens, seq len 256 | Section 4.2, Fig.6 | Fig.6 |
| dynamic depth (active blocks per token) | most tokens use ~40–70 blocks vs full 80 | full depth 80 blocks | average reduction up to ~50% in active layers for some tokens | OPT-13B on WikiText-2, seq len 256 | Section 4.3, Fig.9 | Fig.9 |
What To Try In 7 Days
Profile your transformer layers with the residual-ratio proxy on representative inputs to find skip candidates.
Simulate layer skipping with an oracle threshold (e.g., 5%) to estimate compute savings before engineering changes.
Prototype a tiny router MLP that predicts layer importance per token and measure quality vs compute trade-offs.
Agent Features
Memory
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Simulations use oracles that have future info; real router accuracy may be lower than simulations.
Requires changes to inference stack (routing logic and shared KV cache) and more complex caching.
When Not To Use
Small models where each layer contributes substantially to outputs
Batch-heavy inference setups where per-token routing is hard to apply
Failure Modes
Router mispredictions that skip required layers and degrade output quality
Token hotspots where many tokens need deep paths, causing latency spikes

