Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can train one large reasoning model and ship multiple quality/latency variants without per‑size retraining. That cuts token costs and storage needs, simplifies model ops, and makes offering multiple service tiers cheaper.
Summary TLDR
Nemotron Elastic trains a single hybrid Mamba‑Attention reasoning LLM that contains nested submodels for multiple sizes (12B → 9B and 6B). A learned router plus a two‑stage curriculum (short context then extended 49K tokens) produces extractable submodels zero‑shot, cutting training tokens and deployment memory dramatically. On standard reasoning benchmarks the nested models match or beat baselines while requiring only 110B training tokens to produce 6B and 9B variants from a 12B parent.
Problem Statement
Training separate LLM sizes is very expensive. Existing compression methods still need large extra token budgets and per‑size distillation runs. Reasoning models add another need: long‑context training for multi‑step inference. The paper asks: can one training run produce multiple reasoning models that are ready to deploy without extra fine‑tuning?
Main Contribution
A many‑in‑one elastic training framework for hybrid Mamba‑Attention reasoning models that embeds nested submodels extracted zero‑shot.
A two‑stage curriculum: stabilize router with short contexts, then adapt all budgets to long contexts (49K tokens) for reasoning.
Depth elastification using normalized MSE layer importance for more reliable layer pruning.
Group‑aware SSM (Mamba) elastification and heterogeneous per‑layer width selection (FFN/heads/channel granularity).
End‑to‑end trainable router (Gumbel‑Softmax) coupled with frozen/full teacher KD for joint multi‑budget optimization.
Practical cost and memory gains: 110B tokens to produce 6B+9B from a 12B model; constant deployment memory equal to largest model.
Key Findings
Derive 6B and 9B models from a single 12B run using 110B training tokens.
Token cost reduced ~7× vs Minitron‑SSM compression and ~360× vs pretraining family from scratch.
Nested models match baseline accuracy on reasoning benchmarks: 12B average 77.41 vs NanoV2‑12B 77.38.
Extended‑context (stage 2) yields large gains on hard reasoning tasks, especially for smaller models.
Deploying all three sizes uses memory equal to the largest model: Nemotron Elastic 24GB vs NanoV2 separate models 42GB.
Results
Accuracy
Token budget to obtain 6B+9B from 12B
Deployment memory (BF16 weights) for family
AIME‑2025 improvement from extended context (6B)
Who Should Care
What To Try In 7 Days
Audit current model family cost: compare tokens and checkpoints to projected nested approach.
Run a small proof: convert an existing hybrid or Transformer model into a dynamic masked version and train a tiny two‑budget setup with a frozen teacher.
Implement two‑stage sampling: short context to stabilize router, then long‑context batches to adapt reasoning behavior.
Agent Features
Architectures
- Mamba‑Attention hybrid
Optimization Features
Token Efficiency
- Single elastic distillation run (110B tokens) vs multi‑run pipelines
Infra Optimization
- Router parameter overhead < 2% of largest model size (low memory overhead)
Model Optimization
- Nested weight‑sharing (Matryoshka style)
- Group‑aware SSM elastification
- Heterogeneous per‑layer width selection
System Optimization
- Router coupled to task loss to find Pareto tradeoffs
- Importance‑based ranking to produce contiguous nested masks
Training Optimization
- Two‑stage curriculum (short then extended context)
- Frozen and trainable teacher knowledge distillation
- End‑to‑end router learning with Gumbel‑Softmax
Inference Optimization
- Zero‑shot slicing (extract submodels without fine‑tuning)
- Constant deployment memory regardless of family size
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments are limited to a single family (Nemotron NanoV2 12B → 9B/6B) and one compression data blend.
- Requires extended‑context training (49K tokens) which increases batch memory and training complexity.
- Router and importance estimators were tuned for hybrid Mamba‑Attention; transfer to other architectures may need re‑tuning.
- No public code or exact training recipe URLs provided in paper (models noted on Hugging Face but code release not documented).
When Not To Use
- If you need independently trained models with different architectures or task‑specific fine‑tuning per size.
- If you cannot afford extended‑context training budget or the memory requirements it implies.
- If your deployment requires extreme quantization or hardware flows not validated with nested weight sharing.
Failure Modes
- Uniform budget sampling during long‑context training can collapse full‑model accuracy due to gradient competition (paper observed this).
- Router may choose suboptimal per‑layer budgets if importance ranking is noisy, hurting smaller or larger budgets.
- Group‑aware masking errors could violate SSM structural constraints and break sequence modeling if implemented incorrectly.
Core Entities
Models
- Nemotron Elastic
- Nemotron Nano V2 12B
- Nemotron‑Elastic‑12B
- Nemotron‑Elastic‑9B
- Nemotron‑Elastic‑6B
- Minitron‑SSM
- NanoV2‑9B
- NanoV2‑12B
- QWen3‑8B
- Mamba / Mamba‑2 (SSM)
Metrics
- Accuracy
- Pass@1
- Training tokens (B tokens)
- Deployment memory (GB)
Datasets
- Nemotron NanoV2 compression data blend (used for training and calibration)
Benchmarks
- MATH‑500
- AIME‑2024
- AIME‑2025
- GPQA
- LiveCodeBench v5
- MMLU‑Pro

