Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and 

May 20, 202510 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye

Links

Abstract / PDF

Why It Matters For Business

EfficientLLM translates technical trade-offs into concrete cost, latency, and energy numbers so teams can choose methods that match budgets and SLAs instead of guessing.

Summary TLDR

EfficientLLM is a large, hardware-grounded benchmark that measures efficiency trade-offs for LLMs across three lifecycle stages: architecture pretraining (attention variants and MoE), fine-tuning (PEFT variants like LoRA/RSLoRA/DoRA), and inference (bfloat16/float16 and post‑training int4 quantization). Run on a 48×GH200 + 8×H200 cluster, the study evaluates 100+ model–technique pairs using six system-aware metrics (memory, compute, latency, throughput, energy, compression). Main practical takeaways: no single method wins all axes; int4 yields ~3.9× memory/energy gains at ~3–5% average-task drop; MoE improves quality and reduces FLOPs per token but increases VRAM (~40%); PEFT choice should, 

Problem Statement

Practitioners lack a single, large-scale empirical guide that measures real-world efficiency trade-offs (memory, latency, throughput, energy, compression) across architectures, fine-tuning methods, and low‑bit inference on modern GPUs. Without that, teams choose techniques by hearsay and risk suboptimal cost or performance in production.

Main Contribution

A unified, three-axis benchmark (architecture pretraining, fine-tuning, bit‑width quantization) with real-hardware measurements on GH200/H200 clusters.

A concise metric suite tailored to deployment: AMU, PCU, AL, TT/ST/IT, AEC, and MCR that captures memory, compute, latency, throughput, energy, and compression trade-offs.

A 100+ model-technique empirical study (0.5B–72B params) comparing efficient attention (MQA/GQA/MLA/NSA), MoE, PEFT variants (LoRA/RSLoRA/DoRA/LoRA-plus), and post-training quantization to int4.

Open release of evaluation pipelines, datasets, and leaderboards to help reproduce and extend efficiency comparisons.

Key Findings

No single technique is best across all efficiency axes.

Post-training int4 quantization reduces memory and energy up to 3.9× while causing ~3–5% average-task score drop on evaluated benchmarks.

NumbersMCR ≈ 3.9×; avg score drop ≈ 3–5%

Mixture‑of‑Experts (MoE) improves quality and cuts training FLOPs but raises VRAM by ≈40%.

NumbersVRAM ↑ ≈ 40%; training FLOPs ↓ 1.8×; accuracy gain up to +3.5 pp

Efficient attention variants have different optima: MQA minimizes memory+latency; MLA gives lowest perplexity; NSA minimizes energy per token on tested scales.

NumbersAMU best for MQA (≈42 GB at 1.5B); MLA yields lowest PPL across scales; NSA lowest AEC in tests

PEFT methods scale differently: LoRA/variants excel at 1–3B, RSLoRA outperforms at ≥14B, and parameter freezing yields the lowest tuning latency (≈3× faster).

NumbersFreeze latency ≈3× lower; RSLoRA wins beyond 14B

On Hopper-class GPUs, bfloat16 usually beats float16 by ~6% latency and ~9% energy.

NumbersLatency ≈6% lower; energy ≈9% lower (bfloat16 vs float16)

Results

int4 quantization compression

Valueup to 3.9× MCR vs bf16

Baselinebfloat16

Average task score drop from int4

Value≈3–5 percentage points drop

Baselinebfloat16

MoE VRAM increase

Value≈40% VRAM increase vs dense equivalent

Baselinedense model of equivalent active params

PEFT latency reduction via freezing

Value≈3× lower fine-tuning latency

BaselinePEFT variants (LoRA variants)

bfloat16 vs float16 efficiency

Value≈6% latency and ≈9% energy improvement for bfloat16

Baselinefloat16

Who Should Care

What To Try In 7 Days

Run int4 post-training quantization on a production-weight model and measure task accuracy vs memory/throughput to estimate 3–4× memory savings.

If tuning a >14B model, trial RSLoRA and compare latency/energy to LoRA; use freezing if interactive tuning speed is essential.

For memory-limited serving, swap standard attention for MQA and run a short latency+memory profile to confirm lower AMU and latency.

Optimization Features

Token Efficiency

  • Data filtering and deduplication (FineWeb-Edu)
  • Curriculum learning

Infra Optimization

  • Benchmarking on GH200/H200 GPU clusters
  • NVLink/InfiniBand high-bandwidth interconnect

Model Optimization

  • MoE
  • Efficient attention (MQA, GQA, MLA, NSA)
  • Attention-free backbones (Mamba, RWKV)

System Optimization

  • DeepSpeed ZeRO (memory offload/3)
  • Megatron-Core 3D parallelism
  • FlashAttention kernels

Training Optimization

  • LoRA
  • Mixed-precision training (bfloat16/float16)
  • Scaling-law informed pretraining (Chinchilla-style)

Inference Optimization

  • Post-training quantization to int4
  • bfloat16-first precision strategy
  • PagedAttention / FlashAttention kernels

Reproducibility

Data Urls

  • FineWeb-Edu (described in paper; sampling instructions provided in repo)
  • OpenO1-SFT (referenced dataset release notes in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run on a specific GH200/H200 cluster—results may shift on TPUs or different GPU generations.
  • Int8 results were excluded due to backend instability; int4 findings don't guarantee int8 behavior.
  • Scope excludes systems-level scheduling, RLHF alignment costs, and some test-time acceleration techniques.
  • Metric normalization (min-max) can obscure absolute differences between metrics; inspect raw numbers for economic decisions.

When Not To Use

  • Do not use int4 quantization when small accuracy drops are unacceptable (e.g., high‑stakes medical reasoning) without task-specific validation.
  • Avoid MoE if GPU memory is the binding resource or if your infra cannot handle expert parameters storage.
  • Don't assume LoRA results at small scale will hold at very large scales; test RSLoRA/other methods for ≥14B models.

Failure Modes

  • Int4 quantization causing catastrophic failure on precision‑sensitive tasks (math or long numerical reasoning).
  • MoE routing overhead or memory blow-up when expert parameter storage exceeds device limits.
  • PEFT methods that reduce latency but harm accuracy beyond acceptable thresholds in certain domains.

Core Entities

Models

  • LLaMA-3 (1B/3B/8B/70B)
  • DeepSeek-R1 distill variants (1.5B/8B/14B)
  • Qwen-2.5 (7B/14B/32B/72B)
  • Phi-3.5-mini/Phi-4 (3.5B/14B)
  • Yi-34B
  • Mistral-7B
  • Mixtral (MoE) variants
  • Mamba, Pythia, RWKV (attention-free/alternative backbones)
  • DiT-style LVMs (various sizes)
  • Stable Diffusion 3.5 (LVM)

Metrics

  • AMU (Average Memory Utilization)
  • PCU (Peak Compute Utilization)
  • AL (Average Latency)
  • TT (Token Throughput)
  • ST (Sample Throughput)
  • IT (Inference Throughput)
  • AEC (Average Energy Consumption)
  • MCR (Model Compression Rate)
  • Perplexity (PPL)
  • Tokens/s

Datasets

  • FineWeb-Edu (350B sample)
  • SFT
  • ChatQA
  • Disney Organized
  • WikiArt Sargent

Benchmarks

  • MMLU-Pro
  • BBH
  • GPQA
  • IFEval
  • MATH
  • MuSR
  • HumanEval
  • HARDMath
  • FID (vision quality)