Overview
The paper shows consistent throughput and request-level efficiency gains on H100 nodes and a single H100, with accuracy matched or slightly improved on several reasoning benchmarks; results are tied to specific hardware and the LNPT dataset.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 50%
Why It Matters For Business
You can cut long-context serving cost by redesigning a trained model for your hardware: heterogeneous MoE pruning, selective window attention, and calibrated FP8 KV quantization reduce memory and improve throughput while keeping or improving accuracy on evaluated reasoning tasks.
Who Should Care
Summary TLDR
The authors extend the Puzzle post-training neural-architecture-search flow to handle mixture-of-experts (MoE) and long-context attention. Starting from gpt-oss-120B, they produce gpt-oss-puzzle-88B by pruning experts heterogeneously, selectively converting some layers to sliding window attention, applying FP8 KV-cache quantization with calibrated scales, and doing short distillation plus RL refinements. Result: up to 1.63× node throughput (64K/64K), 1.22× (4K/4K), and up to 2.82× on a single H100, while matching or slightly exceeding suite-average accuracy across reasoning efforts on the evaluated benchmarks.
Problem Statement
Reasoning models generate long internal tokens, which raises KV-cache memory and attention costs and makes raw per-token speed metrics misleading for end-to-end request cost. The paper asks: can post-training architecture search reduce serving cost (latency/throughput/memory) for a large MoE reasoning model without hurting accuracy across long- and short-context settings?
Main Contribution
Adapt Puzzle NAS to handle MoE layers by ranking and heterogeneously removing experts under expert-parallel constraints.
Introduce a long-context-aware scoring signal to pick which attention layers can be switched to window (sliding) attention.
Key Findings
Node throughput improved 1.63× in long-context (64K/64K) and 1.22× in short-context (4K/4K) on an 8× H100 node
Single-GPU throughput increased up to 2.82× for 64K/64K and 2.44× for 4K/4K
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| max token throughput (8× H100, 4K/4K) | 36.1K tok/s | gpt-oss-120B: 29.6K tok/s | 1.22× | 4K/4K scenario (Table 1) | Table 1: 36.1K vs 29.6K | Table 1 |
| max token throughput (8× H100, 64K/64K) | 9.3K tok/s | gpt-oss-120B: 5.7K tok/s | 1.63× | 64K/64K scenario (Table 1) | Table 1: 9.3K vs 5.7K | Table 1 |
What To Try In 7 Days
Measure request-level efficiency: divide max tok/s by average tokens per request to reflect real cost.
Run replace-1 activation MSE to rank layer importance before pruning.
Test FP8 KV quantization with max-calibrated per-KV scales; verify accuracy vs no-scales baseline on a small validation set.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Optimizations and gains are benchmark- and hardware-specific (8× H100 and single H100 reported).
Some tasks and splits still fall below the parent model despite suite-average retention.
When Not To Use
If you need exact token-level parity with the parent on an untested task or dataset.
If your deployment hardware or batch patterns differ drastically from H100 setups used here.
Failure Modes
Converting the wrong attention layers to window attention can degrade long-range behavior.
FP8 KV quantization without calibrated scales reduces accuracy (paper found this empirically).

