Train one hybrid reasoning model, get many deployable sizes for free

Overview

Decision SnapshotReady For Pilot

Claims are supported by end‑to‑end experiments on a 12B hybrid model and multiple benchmarks. Results are strong for the target family sizes but tested on one model family and the same compression data blend.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

Links

Abstract / PDF

Why It Matters For Business

You can train one large reasoning model and ship multiple quality/latency variants without per‑size retraining. That cuts token costs and storage needs, simplifies model ops, and makes offering multiple service tiers cheaper.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

Nemotron Elastic trains a single hybrid Mamba‑Attention reasoning LLM that contains nested submodels for multiple sizes (12B → 9B and 6B). A learned router plus a two‑stage curriculum (short context then extended 49K tokens) produces extractable submodels zero‑shot, cutting training tokens and deployment memory dramatically. On standard reasoning benchmarks the nested models match or beat baselines while requiring only 110B training tokens to produce 6B and 9B variants from a 12B parent.

Problem Statement

Training separate LLM sizes is very expensive. Existing compression methods still need large extra token budgets and per‑size distillation runs. Reasoning models add another need: long‑context training for multi‑step inference. The paper asks: can one training run produce multiple reasoning models that are ready to deploy without extra fine‑tuning?

Main Contribution

A many‑in‑one elastic training framework for hybrid Mamba‑Attention reasoning models that embeds nested submodels extracted zero‑shot.

A two‑stage curriculum: stabilize router with short contexts, then adapt all budgets to long contexts (49K tokens) for reasoning.

Key Findings

Derive 6B and 9B models from a single 12B run using 110B training tokens.

Numbers110B tokens total (Table 2)

Practical UseYou can produce multiple deployable sizes with one training run and avoid separate exploratory distillation runs.

Evidence RefTable 2

Token cost reduced ~7× vs Minitron‑SSM compression and ~360× vs pretraining family from scratch.

NumbersNemotron Elastic 110B vs Minitron‑SSM 750B and pretraining 40T (Table 2)

Practical UseExpect large savings in cloud/GPU token compute when producing a small family of models.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	77.41%	NanoV2‑12B 77.38%	+0.03 pp	Average over MATH‑500, AIME‑2024, AIME‑2025, GPQA, LiveCodeBench v5, MMLU‑Pro	Table 1 shows Nemotron‑Elastic‑12B 77.41 vs NanoV2‑12B 77.38	Table 1
Token budget to obtain 6B+9B from 12B	110B tokens	Minitron‑SSM estimate 750B; NanoV2 pretraining 40T	~7× reduction vs Minitron‑SSM; ~360× vs pretraining	Total token count for compression pipeline	Table 2 token comparison	Table 2

What To Try In 7 Days

Audit current model family cost: compare tokens and checkpoints to projected nested approach.

Run a small proof: convert an existing hybrid or Transformer model into a dynamic masked version and train a tiny two‑budget setup with a frozen teacher.

Implement two‑stage sampling: short context to stabilize router, then long‑context batches to adapt reasoning behavior.

Agent Features

Architectures

Mamba‑Attention hybrid

Optimization Features

Token Efficiency

Single elastic distillation run (110B tokens) vs multi‑run pipelines

Infra Optimization

Router parameter overhead < 2% of largest model size (low memory overhead)

Model Optimization

Nested weight‑sharing (Matryoshka style)Group‑aware SSM elastificationHeterogeneous per‑layer width selection

System Optimization

Router coupled to task loss to find Pareto tradeoffsImportance‑based ranking to produce contiguous nested masks

Training Optimization

Two‑stage curriculum (short then extended context)Frozen and trainable teacher knowledge distillationEnd‑to‑end router learning with Gumbel‑Softmax

Inference Optimization

Zero‑shot slicing (extract submodels without fine‑tuning)Constant deployment memory regardless of family size

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Experiments are limited to a single family (Nemotron NanoV2 12B → 9B/6B) and one compression data blend.

Requires extended‑context training (49K tokens) which increases batch memory and training complexity.

When Not To Use

If you need independently trained models with different architectures or task‑specific fine‑tuning per size.

If you cannot afford extended‑context training budget or the memory requirements it implies.

Failure Modes

Uniform budget sampling during long‑context training can collapse full‑model accuracy due to gradient competition (paper observed this).

Router may choose suboptimal per‑layer budgets if importance ranking is noisy, hurting smaller or larger budgets.

Core Entities

Models

Nemotron ElasticNemotron Nano V2 12BNemotron‑Elastic‑12BNemotron‑Elastic‑9BNemotron‑Elastic‑6BMinitron‑SSMNanoV2‑9BNanoV2‑12BQWen3‑8BMamba / Mamba‑2 (SSM)

Metrics

AccuracyPass@1Training tokens (B tokens)Deployment memory (GB)

Datasets

Nemotron NanoV2 compression data blend (used for training and calibration)

Benchmarks

MATH‑500AIME‑2024AIME‑2025GPQALiveCodeBench v5MMLU‑Pro

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Derive 6B and 9B models from a single 12B run using 110B training tokens.

Token cost reduced ~7× vs Minitron‑SSM compression and ~360× vs pretraining family from scratch.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Recover lost accuracy in corrupted small LMs by training tiny LoRA adapters with synthetic data and logit distillation

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding