Radial Networks: token-level routing that skips whole layers to cut compute and latency

April 7, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang

Links

Abstract / PDF

Why It Matters For Business

Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.

Summary TLDR

The paper profiles modern transformers and finds many layers make tiny, token-dependent contributions. It introduces Radial Networks: a new architecture that routes each token between layers with a small trained router, allowing per-token skipping and layer reuse. Profiling on OPT and ViT models shows median residual contributions fall with model size (e.g., OPT-125M ~20% vs OPT-66B ~5.9%), so large models can safely skip many layers. Radial Networks add a unified global KV cache and can be trained from scratch or distilled from sequential models to cut average depth, compute, and serving costs while keeping capacity high.

Problem Statement

Modern transformers are getting deeper but individual layers often contribute little on a per-token basis. Existing methods (early-exit, width sparsity) either skip wrong layers or need special training. The challenge is to exploit per-token, per-layer variability to reduce compute and latency without large accuracy loss and without changing attention key-value semantics.

Main Contribution

Profiling residual blocks in OPT and ViT families to quantify per-token layer importance using a simple residual-ratio proxy.

Empirical trend: larger models have smaller median residual contributions, creating more opportunities to skip layers at runtime.

Radial Networks: a new architecture that routes tokens between layers via a small router MLP, supports layer reuse, and uses a unified global key-value cache.

Two training paths: post-training distillation from sequential networks or joint training of router and layer weights.

Key Findings

Per-layer residual contributions shrink as model size grows.

NumbersOPT-125M median residual ratio ≈ 20%; OPT-66B ≈ 5.9%

Many tokens do not need the full network depth during generation.

NumbersFor OPT-13B most tokens needed ~40–70 blocks vs full 80 blocks

Residual-ratio provides a cheap proxy for layer importance for profiling and simulation.

NumbersThresholding residual ratio at 5% used to simulate layer skipping

Radial Networks implement token routing plus a shared KV cache to keep attention semantics.

Authors extrapolate very large models will have even smaller layer contributions.

NumbersPaper projects median residual ratios <1% for >1T-parameter models

Results

residual ratio (median)

ValueOPT-125M ≈ 20%; OPT-66B ≈ 5.9%

dynamic depth (active blocks per token)

Valuemost tokens use ~40–70 blocks vs full 80

Baselinefull depth 80 blocks

skip threshold used in simulations

Value5% residual ratio

vision residual ratio (median)

ValueViT-Huge med comparable to OPT-350M

Who Should Care

What To Try In 7 Days

Profile your transformer layers with the residual-ratio proxy on representative inputs to find skip candidates.

Simulate layer skipping with an oracle threshold (e.g., 5%) to estimate compute savings before engineering changes.

Prototype a tiny router MLP that predicts layer importance per token and measure quality vs compute trade-offs.

Agent Features

Memory

  • unified global key-value cache (shared KV cache across layers)

Architectures

  • token-level routing
  • residual-block profiling

Optimization Features

Token Efficiency

  • route tokens to fewer layers to lower compute per token

Infra Optimization

  • lower average serving compute and latency; supports larger-capacity models for same cost

Model Optimization

  • dynamic layer sparsity (per-token skipping)
  • layer reuse via routing

System Optimization

  • unified KV cache to handle sparse per-layer activations

Training Optimization

  • post-training distillation from sequential models
  • joint router + layer co-training

Inference Optimization

  • token-level compute variation to reduce average depth
  • skip whole-layer compute when residual contribution is low

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Simulations use oracles that have future info; real router accuracy may be lower than simulations.
  • Requires changes to inference stack (routing logic and shared KV cache) and more complex caching.
  • Overhead from router and worst-case depth caps can reduce or negate savings for some workloads.

When Not To Use

  • Small models where each layer contributes substantially to outputs
  • Batch-heavy inference setups where per-token routing is hard to apply
  • Hardware lacking fast support for skipping whole-layer compute

Failure Modes

  • Router mispredictions that skip required layers and degrade output quality
  • Token hotspots where many tokens need deep paths, causing latency spikes
  • Shared KV cache management errors increasing memory or correctness bugs

Core Entities

Models

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B
  • OPT-30B
  • OPT-66B
  • ViT-Base
  • ViT-Large
  • ViT-Huge

Metrics

  • residual ratio
  • dynamic depth (active blocks per token)

Datasets

  • WikiText-2
  • COCO

Context Entities

Models

  • MoE