Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.
Summary TLDR
The paper profiles modern transformers and finds many layers make tiny, token-dependent contributions. It introduces Radial Networks: a new architecture that routes each token between layers with a small trained router, allowing per-token skipping and layer reuse. Profiling on OPT and ViT models shows median residual contributions fall with model size (e.g., OPT-125M ~20% vs OPT-66B ~5.9%), so large models can safely skip many layers. Radial Networks add a unified global KV cache and can be trained from scratch or distilled from sequential models to cut average depth, compute, and serving costs while keeping capacity high.
Problem Statement
Modern transformers are getting deeper but individual layers often contribute little on a per-token basis. Existing methods (early-exit, width sparsity) either skip wrong layers or need special training. The challenge is to exploit per-token, per-layer variability to reduce compute and latency without large accuracy loss and without changing attention key-value semantics.
Main Contribution
Profiling residual blocks in OPT and ViT families to quantify per-token layer importance using a simple residual-ratio proxy.
Empirical trend: larger models have smaller median residual contributions, creating more opportunities to skip layers at runtime.
Radial Networks: a new architecture that routes tokens between layers via a small router MLP, supports layer reuse, and uses a unified global key-value cache.
Two training paths: post-training distillation from sequential networks or joint training of router and layer weights.
Key Findings
Per-layer residual contributions shrink as model size grows.
Many tokens do not need the full network depth during generation.
Residual-ratio provides a cheap proxy for layer importance for profiling and simulation.
Radial Networks implement token routing plus a shared KV cache to keep attention semantics.
Authors extrapolate very large models will have even smaller layer contributions.
Results
residual ratio (median)
dynamic depth (active blocks per token)
skip threshold used in simulations
vision residual ratio (median)
Who Should Care
What To Try In 7 Days
Profile your transformer layers with the residual-ratio proxy on representative inputs to find skip candidates.
Simulate layer skipping with an oracle threshold (e.g., 5%) to estimate compute savings before engineering changes.
Prototype a tiny router MLP that predicts layer importance per token and measure quality vs compute trade-offs.
Agent Features
Memory
- unified global key-value cache (shared KV cache across layers)
Architectures
- token-level routing
- residual-block profiling
Optimization Features
Token Efficiency
- route tokens to fewer layers to lower compute per token
Infra Optimization
- lower average serving compute and latency; supports larger-capacity models for same cost
Model Optimization
- dynamic layer sparsity (per-token skipping)
- layer reuse via routing
System Optimization
- unified KV cache to handle sparse per-layer activations
Training Optimization
- post-training distillation from sequential models
- joint router + layer co-training
Inference Optimization
- token-level compute variation to reduce average depth
- skip whole-layer compute when residual contribution is low
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Simulations use oracles that have future info; real router accuracy may be lower than simulations.
- Requires changes to inference stack (routing logic and shared KV cache) and more complex caching.
- Overhead from router and worst-case depth caps can reduce or negate savings for some workloads.
When Not To Use
- Small models where each layer contributes substantially to outputs
- Batch-heavy inference setups where per-token routing is hard to apply
- Hardware lacking fast support for skipping whole-layer compute
Failure Modes
- Router mispredictions that skip required layers and degrade output quality
- Token hotspots where many tokens need deep paths, causing latency spikes
- Shared KV cache management errors increasing memory or correctness bugs
Core Entities
Models
- OPT-125M
- OPT-350M
- OPT-1.3B
- OPT-2.7B
- OPT-6.7B
- OPT-13B
- OPT-30B
- OPT-66B
- ViT-Base
- ViT-Large
- ViT-Huge
Metrics
- residual ratio
- dynamic depth (active blocks per token)
Datasets
- WikiText-2
- COCO
Context Entities
Models
- MoE

