Overview
Production Readiness
0.8
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can cut long-context serving cost by redesigning a trained model for your hardware: heterogeneous MoE pruning, selective window attention, and calibrated FP8 KV quantization reduce memory and improve throughput while keeping or improving accuracy on evaluated reasoning tasks.
Summary TLDR
The authors extend the Puzzle post-training neural-architecture-search flow to handle mixture-of-experts (MoE) and long-context attention. Starting from gpt-oss-120B, they produce gpt-oss-puzzle-88B by pruning experts heterogeneously, selectively converting some layers to sliding window attention, applying FP8 KV-cache quantization with calibrated scales, and doing short distillation plus RL refinements. Result: up to 1.63× node throughput (64K/64K), 1.22× (4K/4K), and up to 2.82× on a single H100, while matching or slightly exceeding suite-average accuracy across reasoning efforts on the evaluated benchmarks.
Problem Statement
Reasoning models generate long internal tokens, which raises KV-cache memory and attention costs and makes raw per-token speed metrics misleading for end-to-end request cost. The paper asks: can post-training architecture search reduce serving cost (latency/throughput/memory) for a large MoE reasoning model without hurting accuracy across long- and short-context settings?
Main Contribution
Adapt Puzzle NAS to handle MoE layers by ranking and heterogeneously removing experts under expert-parallel constraints.
Introduce a long-context-aware scoring signal to pick which attention layers can be switched to window (sliding) attention.
Combine blockwise substitutions with 84B-token knowledge distillation and a dual-variant RL stage to recover and improve accuracy while keeping generation length controlled.
Apply FP8 KV-cache quantization with max-calibrated per-KV scales to double KV token capacity.
Release a deployment-optimized derivative, gpt-oss-puzzle-88B, derived from gpt-oss-120B and tuned for 8×H100 and single-H100 scenarios.
Key Findings
Node throughput improved 1.63× in long-context (64K/64K) and 1.22× in short-context (4K/4K) on an 8× H100 node
Single-GPU throughput increased up to 2.82× for 64K/64K and 2.44× for 4K/4K
Request-level efficiency (throughput normalized by tokens generated) improved up to 1.29× across evaluated reasoning efforts
Selective window attention reduced KV-cache size by ~40% relative to parent for long-context optimization
MoE pruning removed ~25% of experts where safe, using per-expert contribution scores
FP8 KV quantization with max-calibrated per-KV scales doubled KV token capacity versus BF16 KV
Results
max token throughput (8× H100, 4K/4K)
max token throughput (8× H100, 64K/64K)
max token throughput (single H100, 64K/64K)
Accuracy
KV-cache footprint reduction
Who Should Care
What To Try In 7 Days
Measure request-level efficiency: divide max tok/s by average tokens per request to reflect real cost.
Run replace-1 activation MSE to rank layer importance before pruning.
Test FP8 KV quantization with max-calibrated per-KV scales; verify accuracy vs no-scales baseline on a small validation set.
Agent Features
Memory
- KV-cache (quantized FP8)
Tool Use
- Puzzle (post-training NAS)
Frameworks
- Megatron-LM
- vLLM
Architectures
- MoE
- Transformer with selective window attention
- KV-cache quantized serving
Optimization Features
Token Efficiency
- Define request-level efficiency = max tok/s ÷ avg tokens per request
- Track Effort Length Ratio (high/low generation length ratio)
Infra Optimization
- Enables larger effective batch sizes on single H100 by reducing memory footprint
Model Optimization
- Heterogeneous MoE expert pruning
- Selective window (sliding) attention per layer
- RoPE scaling adjustment (32→56)
System Optimization
- Converted 8/18 global attention layers to window attention (64K target)
- Pruned ~25% of experts in MoE layers (4K target)
Training Optimization
- 84B-token knowledge distillation with frozen MoE router/experts
- RL
Inference Optimization
- FP8 KV-cache quantization with max-calibrated scales
- Blockwise replace-1 scoring and MIP selection
- Batch-size and tensor-parallel sweeps to pick best serving config
Reproducibility
Data Urls
- nvidia/Llama-Nemotron-Post-Training-Dataset (referenced; dataset name in paper)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Optimizations and gains are benchmark- and hardware-specific (8× H100 and single H100 reported).
- Some tasks and splits still fall below the parent model despite suite-average retention.
- RL fine-tuning can lengthen generations if trained solely on high-effort data; balancing is delicate.
When Not To Use
- If you need exact token-level parity with the parent on an untested task or dataset.
- If your deployment hardware or batch patterns differ drastically from H100 setups used here.
Failure Modes
- Converting the wrong attention layers to window attention can degrade long-range behavior.
- FP8 KV quantization without calibrated scales reduces accuracy (paper found this empirically).
- Over-regularized RL (balanced mix) can reduce controllability of generation length and lower peak accuracy.
Core Entities
Models
- gpt-oss-120B
- gpt-oss-puzzle-88B
- HyperNova-60B
Metrics
- token throughput (tok/s)
- relative request rate (throughput/tokens_per_request)
- Accuracy
- generated tokens per request (K)
- latency (ms/token)
Datasets
- LNPT-gpt-oss
- MMLU-Pro
- AA-LCR
- AALCR
- GPQA-Diamond
- AIME-25
- SciCode
- IFBench
- RULER
Benchmarks
- MMLU-Pro
- HLE
- GPQA-Diamond
- AIME-25
- SciCode
- IFBench
- AA-LCR
- RULER
Context Entities
Models
- Gemma
- Llama 3 / Nemotron
- Nemotron-H
Metrics
- replace-1-block activation MSE
- AA-LCR gap (AALCR parent - replace-1)
Datasets
- nvidia/Llama-Nemotron-Post-Training-Dataset
Benchmarks
- AA-LCR (used for attention scoring)

