Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and 

May 20, 202510 min

Overview

Decision SnapshotReady For Pilot

The paper provides many real-hardware measurements across modern GPUs and many model sizes, so recommendations are practical; results still reflect hardware and dataset scope (GH200/H200 and the chosen datasets).

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EfficientLLM translates technical trade-offs into concrete cost, latency, and energy numbers so teams can choose methods that match budgets and SLAs instead of guessing.

Who Should Care

Summary TLDR

EfficientLLM is a large, hardware-grounded benchmark that measures efficiency trade-offs for LLMs across three lifecycle stages: architecture pretraining (attention variants and MoE), fine-tuning (PEFT variants like LoRA/RSLoRA/DoRA), and inference (bfloat16/float16 and post‑training int4 quantization). Run on a 48×GH200 + 8×H200 cluster, the study evaluates 100+ model–technique pairs using six system-aware metrics (memory, compute, latency, throughput, energy, compression). Main practical takeaways: no single method wins all axes; int4 yields ~3.9× memory/energy gains at ~3–5% average-task drop; MoE improves quality and reduces FLOPs per token but increases VRAM (~40%); PEFT choice should, 

Problem Statement

Practitioners lack a single, large-scale empirical guide that measures real-world efficiency trade-offs (memory, latency, throughput, energy, compression) across architectures, fine-tuning methods, and low‑bit inference on modern GPUs. Without that, teams choose techniques by hearsay and risk suboptimal cost or performance in production.

Main Contribution

A unified, three-axis benchmark (architecture pretraining, fine-tuning, bit‑width quantization) with real-hardware measurements on GH200/H200 clusters.

A concise metric suite tailored to deployment: AMU, PCU, AL, TT/ST/IT, AEC, and MCR that captures memory, compute, latency, throughput, energy, and compression trade-offs.

Key Findings

No single technique is best across all efficiency axes.

Practical UsePick techniques based on your dominant constraint (memory, latency, or energy) rather than seeking a universal winner.

Evidence RefSection 2.1, overall observations

Post-training int4 quantization reduces memory and energy up to 3.9× while causing ~3–5% average-task score drop on evaluated benchmarks.

NumbersMCR ≈ 3.9×; avg score drop ≈ 35%

Practical UseFor memory- or cost-limited deployment, use int4 to gain ~4× compression and large throughput gains, but validate task accuracy on your data first.

Evidence RefAbstract; Sections 2.1 and 5.5; Table 9

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
int4 quantization compressionup to 3.9× MCR vs bf16bfloat16≈3.9× smaller memoryLLama/DeepSeek/Qwen families (1.5B34B)Table 9 reports MCR ≈3.64–3.9 for int4 across multiple modelsTable 9; Section 5.5
Average task score drop from int4≈35 percentage points dropbfloat16≈-3% to -5% average scoreevaluated benchmarks (MMLU-Pro, BBH, GPQA, etc.)Abstract and Section 2.1 note measured 3–5% average-task score dropAbstract; Section 2.1

What To Try In 7 Days

Run int4 post-training quantization on a production-weight model and measure task accuracy vs memory/throughput to estimate 3–4× memory savings.

If tuning a >14B model, trial RSLoRA and compare latency/energy to LoRA; use freezing if interactive tuning speed is essential.

For memory-limited serving, swap standard attention for MQA and run a short latency+memory profile to confirm lower AMU and latency.

Optimization Features

Token Efficiency
Data filtering and deduplication (FineWeb-Edu)Curriculum learning
Infra Optimization
Benchmarking on GH200/H200 GPU clustersNVLink/InfiniBand high-bandwidth interconnect
Model Optimization
MoEEfficient attention (MQA, GQA, MLA, NSA)Attention-free backbones (Mamba, RWKV)
System Optimization
DeepSpeed ZeRO (memory offload/3)Megatron-Core 3D parallelismFlashAttention kernels
Training Optimization
LoRAMixed-precision training (bfloat16/float16)Scaling-law informed pretraining (Chinchilla-style)
Inference Optimization
Post-training quantization to int4bfloat16-first precision strategyPagedAttention / FlashAttention kernels

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

FineWeb-Edu (described in paper; sampling instructions provided in repo)OpenO1-SFT (referenced dataset release notes in paper)

Risks & Boundaries

Limitations

Experiments run on a specific GH200/H200 cluster—results may shift on TPUs or different GPU generations.

Int8 results were excluded due to backend instability; int4 findings don't guarantee int8 behavior.

When Not To Use

Do not use int4 quantization when small accuracy drops are unacceptable (e.g., high‑stakes medical reasoning) without task-specific validation.

Avoid MoE if GPU memory is the binding resource or if your infra cannot handle expert parameters storage.

Failure Modes

Int4 quantization causing catastrophic failure on precision‑sensitive tasks (math or long numerical reasoning).

MoE routing overhead or memory blow-up when expert parameter storage exceeds device limits.

Core Entities

Models

LLaMA-3 (1B/3B/8B/70B)DeepSeek-R1 distill variants (1.5B/8B/14B)Qwen-2.5 (7B/14B/32B/72B)Phi-3.5-mini/Phi-4 (3.5B/14B)Yi-34BMistral-7BMixtral (MoE) variantsMamba, Pythia, RWKV (attention-free/alternative backbones)DiT-style LVMs (various sizes)Stable Diffusion 3.5 (LVM)

Metrics

AMU (Average Memory Utilization)PCU (Peak Compute Utilization)AL (Average Latency)TT (Token Throughput)ST (Sample Throughput)IT (Inference Throughput)AEC (Average Energy Consumption)MCR (Model Compression Rate)Perplexity (PPL)Tokens/s

Datasets

FineWeb-Edu (350B sample)SFTChatQADisney OrganizedWikiArt Sargent

Benchmarks

MMLU-ProBBHGPQAIFEvalMATHMuSRHumanEvalHARDMathFID (vision quality)