Radial Networks: token-level routing that skips whole layers to cut compute and latency

April 7, 20247 min

Overview

Decision SnapshotNeeds Validation

Paper presents profiling evidence and an architecture proposal; practical gains need engineering validation in a real serving stack with router and cache changes.

Citations0

Evidence Strength0.60

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang

Links

Abstract / PDF

Why It Matters For Business

Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.

Who Should Care

Summary TLDR

The paper profiles modern transformers and finds many layers make tiny, token-dependent contributions. It introduces Radial Networks: a new architecture that routes each token between layers with a small trained router, allowing per-token skipping and layer reuse. Profiling on OPT and ViT models shows median residual contributions fall with model size (e.g., OPT-125M ~20% vs OPT-66B ~5.9%), so large models can safely skip many layers. Radial Networks add a unified global KV cache and can be trained from scratch or distilled from sequential models to cut average depth, compute, and serving costs while keeping capacity high.

Problem Statement

Modern transformers are getting deeper but individual layers often contribute little on a per-token basis. Existing methods (early-exit, width sparsity) either skip wrong layers or need special training. The challenge is to exploit per-token, per-layer variability to reduce compute and latency without large accuracy loss and without changing attention key-value semantics.

Main Contribution

Profiling residual blocks in OPT and ViT families to quantify per-token layer importance using a simple residual-ratio proxy.

Empirical trend: larger models have smaller median residual contributions, creating more opportunities to skip layers at runtime.

Key Findings

Per-layer residual contributions shrink as model size grows.

NumbersOPT-125M median residual ratio ≈ 20%; OPT-66B ≈ 5.9%

Practical UseLarge LLMs are good targets for layer skipping; expect more safe skips in bigger models.

Evidence RefSection 4.2, Fig.6

Many tokens do not need the full network depth during generation.

NumbersFor OPT-13B most tokens needed ~4070 blocks vs full 80 blocks

Practical UseToken-level routing can reduce average active layers per token and cut compute and latency.

Evidence RefSection 4.3, Fig.9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
residual ratio (median)OPT-125M ≈ 20%; OPT-66B ≈ 5.9%WikiText-2 tokens, seq len 256Section 4.2, Fig.6Fig.6
dynamic depth (active blocks per token)most tokens use ~4070 blocks vs full 80full depth 80 blocksaverage reduction up to ~50% in active layers for some tokensOPT-13B on WikiText-2, seq len 256Section 4.3, Fig.9Fig.9

What To Try In 7 Days

Profile your transformer layers with the residual-ratio proxy on representative inputs to find skip candidates.

Simulate layer skipping with an oracle threshold (e.g., 5%) to estimate compute savings before engineering changes.

Prototype a tiny router MLP that predicts layer importance per token and measure quality vs compute trade-offs.

Agent Features

Memory
unified global key-value cache (shared KV cache across layers)
Architectures
token-level routingresidual-block profiling

Optimization Features

Token Efficiency
route tokens to fewer layers to lower compute per token
Infra Optimization
lower average serving compute and latency; supports larger-capacity models for same cost
Model Optimization
dynamic layer sparsity (per-token skipping)layer reuse via routing
System Optimization
unified KV cache to handle sparse per-layer activations
Training Optimization
post-training distillation from sequential modelsjoint router + layer co-training
Inference Optimization
token-level compute variation to reduce average depthskip whole-layer compute when residual contribution is low

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Simulations use oracles that have future info; real router accuracy may be lower than simulations.

Requires changes to inference stack (routing logic and shared KV cache) and more complex caching.

When Not To Use

Small models where each layer contributes substantially to outputs

Batch-heavy inference setups where per-token routing is hard to apply

Failure Modes

Router mispredictions that skip required layers and degrade output quality

Token hotspots where many tokens need deep paths, causing latency spikes

Core Entities

Models

OPT-125MOPT-350MOPT-1.3BOPT-2.7BOPT-6.7BOPT-13BOPT-30BOPT-66BViT-BaseViT-LargeViT-Huge

Metrics

residual ratiodynamic depth (active blocks per token)

Datasets

WikiText-2COCO

Context Entities

Models

MoE