Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

February 12, 20269 min

Overview

Decision SnapshotReady For Pilot

The paper shows consistent throughput and request-level efficiency gains on H100 nodes and a single H100, with accuracy matched or slightly improved on several reasoning benchmarks; results are tied to specific hardware and the LNPT dataset.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 50%

Authors

Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut long-context serving cost by redesigning a trained model for your hardware: heterogeneous MoE pruning, selective window attention, and calibrated FP8 KV quantization reduce memory and improve throughput while keeping or improving accuracy on evaluated reasoning tasks.

Who Should Care

Summary TLDR

The authors extend the Puzzle post-training neural-architecture-search flow to handle mixture-of-experts (MoE) and long-context attention. Starting from gpt-oss-120B, they produce gpt-oss-puzzle-88B by pruning experts heterogeneously, selectively converting some layers to sliding window attention, applying FP8 KV-cache quantization with calibrated scales, and doing short distillation plus RL refinements. Result: up to 1.63× node throughput (64K/64K), 1.22× (4K/4K), and up to 2.82× on a single H100, while matching or slightly exceeding suite-average accuracy across reasoning efforts on the evaluated benchmarks.

Problem Statement

Reasoning models generate long internal tokens, which raises KV-cache memory and attention costs and makes raw per-token speed metrics misleading for end-to-end request cost. The paper asks: can post-training architecture search reduce serving cost (latency/throughput/memory) for a large MoE reasoning model without hurting accuracy across long- and short-context settings?

Main Contribution

Adapt Puzzle NAS to handle MoE layers by ranking and heterogeneously removing experts under expert-parallel constraints.

Introduce a long-context-aware scoring signal to pick which attention layers can be switched to window (sliding) attention.

Key Findings

Node throughput improved 1.63× in long-context (64K/64K) and 1.22× in short-context (4K/4K) on an 8× H100 node

Numbers64K/64K: 9.3K vs 5.7K tok/s; 4K/4K: 36.1K vs 29.6K tok/s (Table 1)

Practical UseOn multi-GPU serving, you can lower cost or serve more requests by swapping parent for the Puzzle-derived model under similar hardware and batching.

Evidence RefTable 1, Section 4.1

Single-GPU throughput increased up to 2.82× for 64K/64K and 2.44× for 4K/4K

NumbersSingle H100 64K/64K: 0.8K vs 0.3K tok/s; 4K/4K: 3.3K vs 1.4K tok/s (Table 1)

Practical UseOn memory-limited devices you can run larger batches or serve more requests without changing hardware.

Evidence RefTable 1, Section 4.1.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
max token throughput (8× H100, 4K/4K)36.1K tok/sgpt-oss-120B: 29.6K tok/s1.22×4K/4K scenario (Table 1)Table 1: 36.1K vs 29.6KTable 1
max token throughput (8× H100, 64K/64K)9.3K tok/sgpt-oss-120B: 5.7K tok/s1.63×64K/64K scenario (Table 1)Table 1: 9.3K vs 5.7KTable 1

What To Try In 7 Days

Measure request-level efficiency: divide max tok/s by average tokens per request to reflect real cost.

Run replace-1 activation MSE to rank layer importance before pruning.

Test FP8 KV quantization with max-calibrated per-KV scales; verify accuracy vs no-scales baseline on a small validation set.

Agent Features

Memory
KV-cache (quantized FP8)
Tool Use
Puzzle (post-training NAS)
Frameworks
Megatron-LMvLLM
Architectures
MoETransformer with selective window attentionKV-cache quantized serving

Optimization Features

Token Efficiency
Define request-level efficiency = max tok/s ÷ avg tokens per requestTrack Effort Length Ratio (high/low generation length ratio)
Infra Optimization
Enables larger effective batch sizes on single H100 by reducing memory footprint
Model Optimization
Heterogeneous MoE expert pruningSelective window (sliding) attention per layerRoPE scaling adjustment (32→56)
System Optimization
Converted 8/18 global attention layers to window attention (64K target)Pruned ~25% of experts in MoE layers (4K target)
Training Optimization
84B-token knowledge distillation with frozen MoE router/expertsRL
Inference Optimization
FP8 KV-cache quantization with max-calibrated scalesBlockwise replace-1 scoring and MIP selectionBatch-size and tensor-parallel sweeps to pick best serving config

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

nvidia/Llama-Nemotron-Post-Training-Dataset (referenced; dataset name in paper)

Risks & Boundaries

Limitations

Optimizations and gains are benchmark- and hardware-specific (8× H100 and single H100 reported).

Some tasks and splits still fall below the parent model despite suite-average retention.

When Not To Use

If you need exact token-level parity with the parent on an untested task or dataset.

If your deployment hardware or batch patterns differ drastically from H100 setups used here.

Failure Modes

Converting the wrong attention layers to window attention can degrade long-range behavior.

FP8 KV quantization without calibrated scales reduces accuracy (paper found this empirically).

Core Entities

Models

gpt-oss-120Bgpt-oss-puzzle-88BHyperNova-60B

Metrics

token throughput (tok/s)relative request rate (throughput/tokens_per_request)Accuracygenerated tokens per request (K)latency (ms/token)

Datasets

LNPT-gpt-ossMMLU-ProAA-LCRAALCRGPQA-DiamondAIME-25SciCodeIFBenchRULER

Benchmarks

MMLU-ProHLEGPQA-DiamondAIME-25SciCodeIFBenchAA-LCRRULER

Context Entities

Models

GemmaLlama 3 / NemotronNemotron-H

Metrics

replace-1-block activation MSEAA-LCR gap (AALCR parent - replace-1)

Datasets

nvidia/Llama-Nemotron-Post-Training-Dataset

Benchmarks

AA-LCR (used for attention scoring)