Train one hybrid reasoning model, get many deployable sizes for free

November 20, 20258 min

Overview

Decision SnapshotReady For Pilot

Claims are supported by end‑to‑end experiments on a 12B hybrid model and multiple benchmarks. Results are strong for the target family sizes but tested on one model family and the same compression data blend.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

Links

Abstract / PDF

Why It Matters For Business

You can train one large reasoning model and ship multiple quality/latency variants without per‑size retraining. That cuts token costs and storage needs, simplifies model ops, and makes offering multiple service tiers cheaper.

Who Should Care

Summary TLDR

Nemotron Elastic trains a single hybrid Mamba‑Attention reasoning LLM that contains nested submodels for multiple sizes (12B → 9B and 6B). A learned router plus a two‑stage curriculum (short context then extended 49K tokens) produces extractable submodels zero‑shot, cutting training tokens and deployment memory dramatically. On standard reasoning benchmarks the nested models match or beat baselines while requiring only 110B training tokens to produce 6B and 9B variants from a 12B parent.

Problem Statement

Training separate LLM sizes is very expensive. Existing compression methods still need large extra token budgets and per‑size distillation runs. Reasoning models add another need: long‑context training for multi‑step inference. The paper asks: can one training run produce multiple reasoning models that are ready to deploy without extra fine‑tuning?

Main Contribution

A many‑in‑one elastic training framework for hybrid Mamba‑Attention reasoning models that embeds nested submodels extracted zero‑shot.

A two‑stage curriculum: stabilize router with short contexts, then adapt all budgets to long contexts (49K tokens) for reasoning.

Key Findings

Derive 6B and 9B models from a single 12B run using 110B training tokens.

Numbers110B tokens total (Table 2)

Practical UseYou can produce multiple deployable sizes with one training run and avoid separate exploratory distillation runs.

Evidence RefTable 2

Token cost reduced ~7× vs Minitron‑SSM compression and ~360× vs pretraining family from scratch.

NumbersNemotron Elastic 110B vs Minitron‑SSM 750B and pretraining 40T (Table 2)

Practical UseExpect large savings in cloud/GPU token compute when producing a small family of models.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy77.41%NanoV2‑12B 77.38%+0.03 ppAverage over MATH‑500, AIME‑2024, AIME‑2025, GPQA, LiveCodeBench v5, MMLU‑ProTable 1 shows Nemotron‑Elastic‑12B 77.41 vs NanoV2‑12B 77.38Table 1
Token budget to obtain 6B+9B from 12B110B tokensMinitron‑SSM estimate 750B; NanoV2 pretraining 40T~7× reduction vs Minitron‑SSM; ~360× vs pretrainingTotal token count for compression pipelineTable 2 token comparisonTable 2

What To Try In 7 Days

Audit current model family cost: compare tokens and checkpoints to projected nested approach.

Run a small proof: convert an existing hybrid or Transformer model into a dynamic masked version and train a tiny two‑budget setup with a frozen teacher.

Implement two‑stage sampling: short context to stabilize router, then long‑context batches to adapt reasoning behavior.

Agent Features

Architectures
Mamba‑Attention hybrid

Optimization Features

Token Efficiency
Single elastic distillation run (110B tokens) vs multi‑run pipelines
Infra Optimization
Router parameter overhead < 2% of largest model size (low memory overhead)
Model Optimization
Nested weight‑sharing (Matryoshka style)Group‑aware SSM elastificationHeterogeneous per‑layer width selection
System Optimization
Router coupled to task loss to find Pareto tradeoffsImportance‑based ranking to produce contiguous nested masks
Training Optimization
Two‑stage curriculum (short then extended context)Frozen and trainable teacher knowledge distillationEnd‑to‑end router learning with Gumbel‑Softmax
Inference Optimization
Zero‑shot slicing (extract submodels without fine‑tuning)Constant deployment memory regardless of family size

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments are limited to a single family (Nemotron NanoV2 12B → 9B/6B) and one compression data blend.

Requires extended‑context training (49K tokens) which increases batch memory and training complexity.

When Not To Use

If you need independently trained models with different architectures or task‑specific fine‑tuning per size.

If you cannot afford extended‑context training budget or the memory requirements it implies.

Failure Modes

Uniform budget sampling during long‑context training can collapse full‑model accuracy due to gradient competition (paper observed this).

Router may choose suboptimal per‑layer budgets if importance ranking is noisy, hurting smaller or larger budgets.

Core Entities

Models

Nemotron ElasticNemotron Nano V2 12BNemotron‑Elastic‑12BNemotron‑Elastic‑9BNemotron‑Elastic‑6BMinitron‑SSMNanoV2‑9BNanoV2‑12BQWen3‑8BMamba / Mamba‑2 (SSM)

Metrics

AccuracyPass@1Training tokens (B tokens)Deployment memory (GB)

Datasets

Nemotron NanoV2 compression data blend (used for training and calibration)

Benchmarks

MATH‑500AIME‑2024AIME‑2025GPQALiveCodeBench v5MMLU‑Pro