Radial Networks: token-level routing that skips whole layers to cut compute and latency

Overview

Decision SnapshotNeeds Validation

Paper presents profiling evidence and an architecture proposal; practical gains need engineering validation in a real serving stack with router and cache changes.

Citations0

Evidence Strength0.60

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang

Links

Abstract / PDF

Why It Matters For Business

Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The paper profiles modern transformers and finds many layers make tiny, token-dependent contributions. It introduces Radial Networks: a new architecture that routes each token between layers with a small trained router, allowing per-token skipping and layer reuse. Profiling on OPT and ViT models shows median residual contributions fall with model size (e.g., OPT-125M ~20% vs OPT-66B ~5.9%), so large models can safely skip many layers. Radial Networks add a unified global KV cache and can be trained from scratch or distilled from sequential models to cut average depth, compute, and serving costs while keeping capacity high.

Problem Statement

Modern transformers are getting deeper but individual layers often contribute little on a per-token basis. Existing methods (early-exit, width sparsity) either skip wrong layers or need special training. The challenge is to exploit per-token, per-layer variability to reduce compute and latency without large accuracy loss and without changing attention key-value semantics.

Main Contribution

Profiling residual blocks in OPT and ViT families to quantify per-token layer importance using a simple residual-ratio proxy.

Empirical trend: larger models have smaller median residual contributions, creating more opportunities to skip layers at runtime.

Key Findings

Per-layer residual contributions shrink as model size grows.

NumbersOPT-125M median residual ratio ≈ 20%; OPT-66B ≈ 5.9%

Practical UseLarge LLMs are good targets for layer skipping; expect more safe skips in bigger models.

Evidence RefSection 4.2, Fig.6

Many tokens do not need the full network depth during generation.

NumbersFor OPT-13B most tokens needed ~40–70 blocks vs full 80 blocks

Practical UseToken-level routing can reduce average active layers per token and cut compute and latency.

Evidence RefSection 4.3, Fig.9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
residual ratio (median)	OPT-125M ≈ 20%; OPT-66B ≈ 5.9%	—	—	WikiText-2 tokens, seq len 256	Section 4.2, Fig.6	Fig.6
dynamic depth (active blocks per token)	most tokens use ~40–70 blocks vs full 80	full depth 80 blocks	average reduction up to ~50% in active layers for some tokens	OPT-13B on WikiText-2, seq len 256	Section 4.3, Fig.9	Fig.9

What To Try In 7 Days

Profile your transformer layers with the residual-ratio proxy on representative inputs to find skip candidates.

Simulate layer skipping with an oracle threshold (e.g., 5%) to estimate compute savings before engineering changes.

Prototype a tiny router MLP that predicts layer importance per token and measure quality vs compute trade-offs.

Agent Features

Memory

unified global key-value cache (shared KV cache across layers)

Architectures

token-level routingresidual-block profiling

Optimization Features

Token Efficiency

route tokens to fewer layers to lower compute per token

Infra Optimization

lower average serving compute and latency; supports larger-capacity models for same cost

Model Optimization

dynamic layer sparsity (per-token skipping)layer reuse via routing

System Optimization

unified KV cache to handle sparse per-layer activations

Training Optimization

post-training distillation from sequential modelsjoint router + layer co-training

Inference Optimization

token-level compute variation to reduce average depthskip whole-layer compute when residual contribution is low

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Simulations use oracles that have future info; real router accuracy may be lower than simulations.

Requires changes to inference stack (routing logic and shared KV cache) and more complex caching.

When Not To Use

Small models where each layer contributes substantially to outputs

Batch-heavy inference setups where per-token routing is hard to apply

Failure Modes

Router mispredictions that skip required layers and degrade output quality

Token hotspots where many tokens need deep paths, causing latency spikes

Core Entities

Models

OPT-125MOPT-350MOPT-1.3BOPT-2.7BOPT-6.7BOPT-13BOPT-30BOPT-66BViT-BaseViT-LargeViT-Huge

Metrics

residual ratiodynamic depth (active blocks per token)

Datasets

WikiText-2COCO

Context Entities

Models

MoE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Per-layer residual contributions shrink as model size grows.

Many tokens do not need the full network depth during generation.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding