Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

February 12, 20269 min

Overview

Production Readiness

0.8

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

Links

Abstract / PDF

Why It Matters For Business

You can cut long-context serving cost by redesigning a trained model for your hardware: heterogeneous MoE pruning, selective window attention, and calibrated FP8 KV quantization reduce memory and improve throughput while keeping or improving accuracy on evaluated reasoning tasks.

Summary TLDR

The authors extend the Puzzle post-training neural-architecture-search flow to handle mixture-of-experts (MoE) and long-context attention. Starting from gpt-oss-120B, they produce gpt-oss-puzzle-88B by pruning experts heterogeneously, selectively converting some layers to sliding window attention, applying FP8 KV-cache quantization with calibrated scales, and doing short distillation plus RL refinements. Result: up to 1.63× node throughput (64K/64K), 1.22× (4K/4K), and up to 2.82× on a single H100, while matching or slightly exceeding suite-average accuracy across reasoning efforts on the evaluated benchmarks.

Problem Statement

Reasoning models generate long internal tokens, which raises KV-cache memory and attention costs and makes raw per-token speed metrics misleading for end-to-end request cost. The paper asks: can post-training architecture search reduce serving cost (latency/throughput/memory) for a large MoE reasoning model without hurting accuracy across long- and short-context settings?

Main Contribution

Adapt Puzzle NAS to handle MoE layers by ranking and heterogeneously removing experts under expert-parallel constraints.

Introduce a long-context-aware scoring signal to pick which attention layers can be switched to window (sliding) attention.

Combine blockwise substitutions with 84B-token knowledge distillation and a dual-variant RL stage to recover and improve accuracy while keeping generation length controlled.

Apply FP8 KV-cache quantization with max-calibrated per-KV scales to double KV token capacity.

Release a deployment-optimized derivative, gpt-oss-puzzle-88B, derived from gpt-oss-120B and tuned for 8×H100 and single-H100 scenarios.

Key Findings

Node throughput improved 1.63× in long-context (64K/64K) and 1.22× in short-context (4K/4K) on an 8× H100 node

Numbers64K/64K: 9.3K vs 5.7K tok/s; 4K/4K: 36.1K vs 29.6K tok/s (Table 1)

Single-GPU throughput increased up to 2.82× for 64K/64K and 2.44× for 4K/4K

NumbersSingle H100 64K/64K: 0.8K vs 0.3K tok/s; 4K/4K: 3.3K vs 1.4K tok/s (Table 1)

Request-level efficiency (throughput normalized by tokens generated) improved up to 1.29× across evaluated reasoning efforts

NumbersUp to 1.29× higher request-level efficiency; suite-average accuracy retention 100.8%–108.2% (Section 4 intro, Figure 1)

Selective window attention reduced KV-cache size by ~40% relative to parent for long-context optimization

NumbersKV-cache size 40% smaller after converting 8 of 18 global attention layers (Section 2)

MoE pruning removed ~25% of experts where safe, using per-expert contribution scores

NumbersPruned experts to 8–128 choice set; overall removed 25% of experts in MoE layers for 4K/4K target (Section 2)

FP8 KV quantization with max-calibrated per-KV scales doubled KV token capacity versus BF16 KV

NumbersFP8 KV enabled ~2× KV-cache token capacity and faster attention modules (Section 3)

Results

max token throughput (8× H100, 4K/4K)

Value36.1K tok/s

Baselinegpt-oss-120B: 29.6K tok/s

max token throughput (8× H100, 64K/64K)

Value9.3K tok/s

Baselinegpt-oss-120B: 5.7K tok/s

max token throughput (single H100, 64K/64K)

Value0.8K tok/s

Baselinegpt-oss-120B: 0.3K tok/s

Accuracy

ValueHigh: 58.67%; Medium: 54.93%; Low: 48.38%

Baselinegpt-oss-120B KV BF16: High 59.20%; Medium 53.66%; Low 45.41%

KV-cache footprint reduction

Value40% smaller KV-cache (selected layers)

Baselineparent gpt-oss-120B KV-cache

Who Should Care

What To Try In 7 Days

Measure request-level efficiency: divide max tok/s by average tokens per request to reflect real cost.

Run replace-1 activation MSE to rank layer importance before pruning.

Test FP8 KV quantization with max-calibrated per-KV scales; verify accuracy vs no-scales baseline on a small validation set.

Agent Features

Memory

  • KV-cache (quantized FP8)

Tool Use

  • Puzzle (post-training NAS)

Frameworks

  • Megatron-LM
  • vLLM

Architectures

  • MoE
  • Transformer with selective window attention
  • KV-cache quantized serving

Optimization Features

Token Efficiency

  • Define request-level efficiency = max tok/s ÷ avg tokens per request
  • Track Effort Length Ratio (high/low generation length ratio)

Infra Optimization

  • Enables larger effective batch sizes on single H100 by reducing memory footprint

Model Optimization

  • Heterogeneous MoE expert pruning
  • Selective window (sliding) attention per layer
  • RoPE scaling adjustment (32→56)

System Optimization

  • Converted 8/18 global attention layers to window attention (64K target)
  • Pruned ~25% of experts in MoE layers (4K target)

Training Optimization

  • 84B-token knowledge distillation with frozen MoE router/experts
  • RL

Inference Optimization

  • FP8 KV-cache quantization with max-calibrated scales
  • Blockwise replace-1 scoring and MIP selection
  • Batch-size and tensor-parallel sweeps to pick best serving config

Reproducibility

Data Urls

  • nvidia/Llama-Nemotron-Post-Training-Dataset (referenced; dataset name in paper)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Optimizations and gains are benchmark- and hardware-specific (8× H100 and single H100 reported).
  • Some tasks and splits still fall below the parent model despite suite-average retention.
  • RL fine-tuning can lengthen generations if trained solely on high-effort data; balancing is delicate.

When Not To Use

  • If you need exact token-level parity with the parent on an untested task or dataset.
  • If your deployment hardware or batch patterns differ drastically from H100 setups used here.

Failure Modes

  • Converting the wrong attention layers to window attention can degrade long-range behavior.
  • FP8 KV quantization without calibrated scales reduces accuracy (paper found this empirically).
  • Over-regularized RL (balanced mix) can reduce controllability of generation length and lower peak accuracy.

Core Entities

Models

  • gpt-oss-120B
  • gpt-oss-puzzle-88B
  • HyperNova-60B

Metrics

  • token throughput (tok/s)
  • relative request rate (throughput/tokens_per_request)
  • Accuracy
  • generated tokens per request (K)
  • latency (ms/token)

Datasets

  • LNPT-gpt-oss
  • MMLU-Pro
  • AA-LCR
  • AALCR
  • GPQA-Diamond
  • AIME-25
  • SciCode
  • IFBench
  • RULER

Benchmarks

  • MMLU-Pro
  • HLE
  • GPQA-Diamond
  • AIME-25
  • SciCode
  • IFBench
  • AA-LCR
  • RULER

Context Entities

Models

  • Gemma
  • Llama 3 / Nemotron
  • Nemotron-H

Metrics

  • replace-1-block activation MSE
  • AA-LCR gap (AALCR parent - replace-1)

Datasets

  • nvidia/Llama-Nemotron-Post-Training-Dataset

Benchmarks

  • AA-LCR (used for attention scoring)