Train one hybrid reasoning model, get many deployable sizes for free

November 20, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

Links

Abstract / PDF

Why It Matters For Business

You can train one large reasoning model and ship multiple quality/latency variants without per‑size retraining. That cuts token costs and storage needs, simplifies model ops, and makes offering multiple service tiers cheaper.

Summary TLDR

Nemotron Elastic trains a single hybrid Mamba‑Attention reasoning LLM that contains nested submodels for multiple sizes (12B → 9B and 6B). A learned router plus a two‑stage curriculum (short context then extended 49K tokens) produces extractable submodels zero‑shot, cutting training tokens and deployment memory dramatically. On standard reasoning benchmarks the nested models match or beat baselines while requiring only 110B training tokens to produce 6B and 9B variants from a 12B parent.

Problem Statement

Training separate LLM sizes is very expensive. Existing compression methods still need large extra token budgets and per‑size distillation runs. Reasoning models add another need: long‑context training for multi‑step inference. The paper asks: can one training run produce multiple reasoning models that are ready to deploy without extra fine‑tuning?

Main Contribution

A many‑in‑one elastic training framework for hybrid Mamba‑Attention reasoning models that embeds nested submodels extracted zero‑shot.

A two‑stage curriculum: stabilize router with short contexts, then adapt all budgets to long contexts (49K tokens) for reasoning.

Depth elastification using normalized MSE layer importance for more reliable layer pruning.

Group‑aware SSM (Mamba) elastification and heterogeneous per‑layer width selection (FFN/heads/channel granularity).

End‑to‑end trainable router (Gumbel‑Softmax) coupled with frozen/full teacher KD for joint multi‑budget optimization.

Practical cost and memory gains: 110B tokens to produce 6B+9B from a 12B model; constant deployment memory equal to largest model.

Key Findings

Derive 6B and 9B models from a single 12B run using 110B training tokens.

Numbers110B tokens total (Table 2)

Token cost reduced ~7× vs Minitron‑SSM compression and ~360× vs pretraining family from scratch.

NumbersNemotron Elastic 110B vs Minitron‑SSM 750B and pretraining 40T (Table 2)

Nested models match baseline accuracy on reasoning benchmarks: 12B average 77.41 vs NanoV2‑12B 77.38.

NumbersAverage accuracy 77.41 vs 77.38 (Table 1)

Extended‑context (stage 2) yields large gains on hard reasoning tasks, especially for smaller models.

NumbersAIME‑2025 6B: +19.8% (56.88 → 68.13); 9B: +9.7% (Table 4)

Deploying all three sizes uses memory equal to the largest model: Nemotron Elastic 24GB vs NanoV2 separate models 42GB.

Numbers24 GB vs 42 GB (Table 3)

Results

Accuracy

Value77.41%

BaselineNanoV2‑12B 77.38%

Token budget to obtain 6B+9B from 12B

Value110B tokens

BaselineMinitron‑SSM estimate 750B; NanoV2 pretraining 40T

Deployment memory (BF16 weights) for family

Value24 GB (6B + 9B + 12B nested)

BaselineNanoV2 9B + 12B separate models 42 GB

AIME‑2025 improvement from extended context (6B)

Value+11.25 pp (56.88 → 68.13)

BaselineStage 1 short context

Who Should Care

What To Try In 7 Days

Audit current model family cost: compare tokens and checkpoints to projected nested approach.

Run a small proof: convert an existing hybrid or Transformer model into a dynamic masked version and train a tiny two‑budget setup with a frozen teacher.

Implement two‑stage sampling: short context to stabilize router, then long‑context batches to adapt reasoning behavior.

Agent Features

Architectures

  • Mamba‑Attention hybrid

Optimization Features

Token Efficiency

  • Single elastic distillation run (110B tokens) vs multi‑run pipelines

Infra Optimization

  • Router parameter overhead < 2% of largest model size (low memory overhead)

Model Optimization

  • Nested weight‑sharing (Matryoshka style)
  • Group‑aware SSM elastification
  • Heterogeneous per‑layer width selection

System Optimization

  • Router coupled to task loss to find Pareto tradeoffs
  • Importance‑based ranking to produce contiguous nested masks

Training Optimization

  • Two‑stage curriculum (short then extended context)
  • Frozen and trainable teacher knowledge distillation
  • End‑to‑end router learning with Gumbel‑Softmax

Inference Optimization

  • Zero‑shot slicing (extract submodels without fine‑tuning)
  • Constant deployment memory regardless of family size

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments are limited to a single family (Nemotron NanoV2 12B → 9B/6B) and one compression data blend.
  • Requires extended‑context training (49K tokens) which increases batch memory and training complexity.
  • Router and importance estimators were tuned for hybrid Mamba‑Attention; transfer to other architectures may need re‑tuning.
  • No public code or exact training recipe URLs provided in paper (models noted on Hugging Face but code release not documented).

When Not To Use

  • If you need independently trained models with different architectures or task‑specific fine‑tuning per size.
  • If you cannot afford extended‑context training budget or the memory requirements it implies.
  • If your deployment requires extreme quantization or hardware flows not validated with nested weight sharing.

Failure Modes

  • Uniform budget sampling during long‑context training can collapse full‑model accuracy due to gradient competition (paper observed this).
  • Router may choose suboptimal per‑layer budgets if importance ranking is noisy, hurting smaller or larger budgets.
  • Group‑aware masking errors could violate SSM structural constraints and break sequence modeling if implemented incorrectly.

Core Entities

Models

  • Nemotron Elastic
  • Nemotron Nano V2 12B
  • Nemotron‑Elastic‑12B
  • Nemotron‑Elastic‑9B
  • Nemotron‑Elastic‑6B
  • Minitron‑SSM
  • NanoV2‑9B
  • NanoV2‑12B
  • QWen3‑8B
  • Mamba / Mamba‑2 (SSM)

Metrics

  • Accuracy
  • Pass@1
  • Training tokens (B tokens)
  • Deployment memory (GB)

Datasets

  • Nemotron NanoV2 compression data blend (used for training and calibration)

Benchmarks

  • MATH‑500
  • AIME‑2024
  • AIME‑2025
  • GPQA
  • LiveCodeBench v5
  • MMLU‑Pro