Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

Overview

Decision SnapshotReady For Pilot

The paper shows consistent throughput and request-level efficiency gains on H100 nodes and a single H100, with accuracy matched or slightly improved on several reasoning benchmarks; results are tied to specific hardware and the LNPT dataset.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 50%

Authors

Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut long-context serving cost by redesigning a trained model for your hardware: heterogeneous MoE pruning, selective window attention, and calibrated FP8 KV quantization reduce memory and improve throughput while keeping or improving accuracy on evaluated reasoning tasks.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The authors extend the Puzzle post-training neural-architecture-search flow to handle mixture-of-experts (MoE) and long-context attention. Starting from gpt-oss-120B, they produce gpt-oss-puzzle-88B by pruning experts heterogeneously, selectively converting some layers to sliding window attention, applying FP8 KV-cache quantization with calibrated scales, and doing short distillation plus RL refinements. Result: up to 1.63× node throughput (64K/64K), 1.22× (4K/4K), and up to 2.82× on a single H100, while matching or slightly exceeding suite-average accuracy across reasoning efforts on the evaluated benchmarks.

Problem Statement

Reasoning models generate long internal tokens, which raises KV-cache memory and attention costs and makes raw per-token speed metrics misleading for end-to-end request cost. The paper asks: can post-training architecture search reduce serving cost (latency/throughput/memory) for a large MoE reasoning model without hurting accuracy across long- and short-context settings?

Main Contribution

Adapt Puzzle NAS to handle MoE layers by ranking and heterogeneously removing experts under expert-parallel constraints.

Introduce a long-context-aware scoring signal to pick which attention layers can be switched to window (sliding) attention.

Key Findings

Node throughput improved 1.63× in long-context (64K/64K) and 1.22× in short-context (4K/4K) on an 8× H100 node

Numbers64K/64K: 9.3K vs 5.7K tok/s; 4K/4K: 36.1K vs 29.6K tok/s (Table 1)

Practical UseOn multi-GPU serving, you can lower cost or serve more requests by swapping parent for the Puzzle-derived model under similar hardware and batching.

Evidence RefTable 1, Section 4.1

Single-GPU throughput increased up to 2.82× for 64K/64K and 2.44× for 4K/4K

NumbersSingle H100 64K/64K: 0.8K vs 0.3K tok/s; 4K/4K: 3.3K vs 1.4K tok/s (Table 1)

Practical UseOn memory-limited devices you can run larger batches or serve more requests without changing hardware.

Evidence RefTable 1, Section 4.1.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
max token throughput (8× H100, 4K/4K)	36.1K tok/s	gpt-oss-120B: 29.6K tok/s	1.22×	4K/4K scenario (Table 1)	Table 1: 36.1K vs 29.6K	Table 1
max token throughput (8× H100, 64K/64K)	9.3K tok/s	gpt-oss-120B: 5.7K tok/s	1.63×	64K/64K scenario (Table 1)	Table 1: 9.3K vs 5.7K	Table 1

What To Try In 7 Days

Measure request-level efficiency: divide max tok/s by average tokens per request to reflect real cost.

Run replace-1 activation MSE to rank layer importance before pruning.

Test FP8 KV quantization with max-calibrated per-KV scales; verify accuracy vs no-scales baseline on a small validation set.

Agent Features

Memory

KV-cache (quantized FP8)

Tool Use

Puzzle (post-training NAS)

Frameworks

Megatron-LMvLLM

Architectures

MoETransformer with selective window attentionKV-cache quantized serving

Optimization Features

Token Efficiency

Define request-level efficiency = max tok/s ÷ avg tokens per requestTrack Effort Length Ratio (high/low generation length ratio)

Infra Optimization

Enables larger effective batch sizes on single H100 by reducing memory footprint

Model Optimization

Heterogeneous MoE expert pruningSelective window (sliding) attention per layerRoPE scaling adjustment (32→56)

System Optimization

Converted 8/18 global attention layers to window attention (64K target)Pruned ~25% of experts in MoE layers (4K target)

Training Optimization

84B-token knowledge distillation with frozen MoE router/expertsRL

Inference Optimization

FP8 KV-cache quantization with max-calibrated scalesBlockwise replace-1 scoring and MIP selectionBatch-size and tensor-parallel sweeps to pick best serving config

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

nvidia/Llama-Nemotron-Post-Training-Dataset (referenced; dataset name in paper)

Risks & Boundaries

Limitations

Optimizations and gains are benchmark- and hardware-specific (8× H100 and single H100 reported).

Some tasks and splits still fall below the parent model despite suite-average retention.

When Not To Use

If you need exact token-level parity with the parent on an untested task or dataset.

If your deployment hardware or batch patterns differ drastically from H100 setups used here.

Failure Modes

Converting the wrong attention layers to window attention can degrade long-range behavior.

FP8 KV quantization without calibrated scales reduces accuracy (paper found this empirically).

Core Entities

Models

gpt-oss-120Bgpt-oss-puzzle-88BHyperNova-60B

Metrics

token throughput (tok/s)relative request rate (throughput/tokens_per_request)Accuracygenerated tokens per request (K)latency (ms/token)

Datasets

LNPT-gpt-ossMMLU-ProAA-LCRAALCRGPQA-DiamondAIME-25SciCodeIFBenchRULER

Benchmarks

MMLU-ProHLEGPQA-DiamondAIME-25SciCodeIFBenchAA-LCRRULER

Context Entities

Models

GemmaLlama 3 / NemotronNemotron-H

Metrics

replace-1-block activation MSEAA-LCR gap (AALCR parent - replace-1)

Datasets

nvidia/Llama-Nemotron-Post-Training-Dataset

Benchmarks

AA-LCR (used for attention scoring)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Node throughput improved 1.63× in long-context (64K/64K) and 1.22× in short-context (4K/4K) on an 8× H100 node

Single-GPU throughput increased up to 2.82× for 64K/64K and 2.44× for 4K/4K

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

Key finding

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

Key finding

Find which MoE experts actually use context, then only tune them — big gains with far fewer trainable parameters.

Key finding