Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Overview

Decision SnapshotReady For Pilot

The paper provides many real-hardware measurements across modern GPUs and many model sizes, so recommendations are practical; results still reflect hardware and dataset scope (GH200/H200 and the chosen datasets).

Citations0

Evidence Strength0.85

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EfficientLLM translates technical trade-offs into concrete cost, latency, and energy numbers so teams can choose methods that match budgets and SLAs instead of guessing.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder Data Scientist

Summary TLDR

EfficientLLM is a large, hardware-grounded benchmark that measures efficiency trade-offs for LLMs across three lifecycle stages: architecture pretraining (attention variants and MoE), fine-tuning (PEFT variants like LoRA/RSLoRA/DoRA), and inference (bfloat16/float16 and post‑training int4 quantization). Run on a 48×GH200 + 8×H200 cluster, the study evaluates 100+ model–technique pairs using six system-aware metrics (memory, compute, latency, throughput, energy, compression). Main practical takeaways: no single method wins all axes; int4 yields ~3.9× memory/energy gains at ~3–5% average-task drop; MoE improves quality and reduces FLOPs per token but increases VRAM (~40%); PEFT choice should, 

Problem Statement

Practitioners lack a single, large-scale empirical guide that measures real-world efficiency trade-offs (memory, latency, throughput, energy, compression) across architectures, fine-tuning methods, and low‑bit inference on modern GPUs. Without that, teams choose techniques by hearsay and risk suboptimal cost or performance in production.

Main Contribution

A unified, three-axis benchmark (architecture pretraining, fine-tuning, bit‑width quantization) with real-hardware measurements on GH200/H200 clusters.

A concise metric suite tailored to deployment: AMU, PCU, AL, TT/ST/IT, AEC, and MCR that captures memory, compute, latency, throughput, energy, and compression trade-offs.

Key Findings

No single technique is best across all efficiency axes.

Practical UsePick techniques based on your dominant constraint (memory, latency, or energy) rather than seeking a universal winner.

Evidence RefSection 2.1, overall observations

Post-training int4 quantization reduces memory and energy up to 3.9× while causing ~3–5% average-task score drop on evaluated benchmarks.

NumbersMCR ≈ 3.9×; avg score drop ≈ 3–5%

Practical UseFor memory- or cost-limited deployment, use int4 to gain ~4× compression and large throughput gains, but validate task accuracy on your data first.

Evidence RefAbstract; Sections 2.1 and 5.5; Table 9

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
int4 quantization compression	up to 3.9× MCR vs bf16	bfloat16	≈3.9× smaller memory	LLama/DeepSeek/Qwen families (1.5B–34B)	Table 9 reports MCR ≈3.64–3.9 for int4 across multiple models	Table 9; Section 5.5
Average task score drop from int4	≈3–5 percentage points drop	bfloat16	≈-3% to -5% average score	evaluated benchmarks (MMLU-Pro, BBH, GPQA, etc.)	Abstract and Section 2.1 note measured 3–5% average-task score drop	Abstract; Section 2.1

What To Try In 7 Days

Run int4 post-training quantization on a production-weight model and measure task accuracy vs memory/throughput to estimate 3–4× memory savings.

If tuning a >14B model, trial RSLoRA and compare latency/energy to LoRA; use freezing if interactive tuning speed is essential.

For memory-limited serving, swap standard attention for MQA and run a short latency+memory profile to confirm lower AMU and latency.

Optimization Features

Token Efficiency

Data filtering and deduplication (FineWeb-Edu)Curriculum learning

Infra Optimization

Benchmarking on GH200/H200 GPU clustersNVLink/InfiniBand high-bandwidth interconnect

Model Optimization

MoEEfficient attention (MQA, GQA, MLA, NSA)Attention-free backbones (Mamba, RWKV)

System Optimization

DeepSpeed ZeRO (memory offload/3)Megatron-Core 3D parallelismFlashAttention kernels

Training Optimization

LoRAMixed-precision training (bfloat16/float16)Scaling-law informed pretraining (Chinchilla-style)

Inference Optimization

Post-training quantization to int4bfloat16-first precision strategyPagedAttention / FlashAttention kernels

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://dlyuangod.github.io/EfficientLLM/https://huggingface.co/Tyrannosaurus/EfficientLLM https://arxiv.org/pdf/2505.13840v1

Data URLs

FineWeb-Edu (described in paper; sampling instructions provided in repo)OpenO1-SFT (referenced dataset release notes in paper)

Risks & Boundaries

Limitations

Experiments run on a specific GH200/H200 cluster—results may shift on TPUs or different GPU generations.

Int8 results were excluded due to backend instability; int4 findings don't guarantee int8 behavior.

When Not To Use

Do not use int4 quantization when small accuracy drops are unacceptable (e.g., high‑stakes medical reasoning) without task-specific validation.

Avoid MoE if GPU memory is the binding resource or if your infra cannot handle expert parameters storage.

Failure Modes

Int4 quantization causing catastrophic failure on precision‑sensitive tasks (math or long numerical reasoning).

MoE routing overhead or memory blow-up when expert parameter storage exceeds device limits.

Core Entities

Models

LLaMA-3 (1B/3B/8B/70B)DeepSeek-R1 distill variants (1.5B/8B/14B)Qwen-2.5 (7B/14B/32B/72B)Phi-3.5-mini/Phi-4 (3.5B/14B)Yi-34BMistral-7BMixtral (MoE) variantsMamba, Pythia, RWKV (attention-free/alternative backbones)DiT-style LVMs (various sizes)Stable Diffusion 3.5 (LVM)

Metrics

AMU (Average Memory Utilization)PCU (Peak Compute Utilization)AL (Average Latency)TT (Token Throughput)ST (Sample Throughput)IT (Inference Throughput)AEC (Average Energy Consumption)MCR (Model Compression Rate)Perplexity (PPL)Tokens/s

Datasets

FineWeb-Edu (350B sample)SFTChatQADisney OrganizedWikiArt Sargent

Benchmarks

MMLU-ProBBHGPQAIFEvalMATHMuSRHumanEvalHARDMathFID (vision quality)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

No single technique is best across all efficiency axes.

Post-training int4 quantization reduces memory and energy up to 3.9× while causing ~3–5% average-task score drop on evaluated benchmarks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Survey: how to run reasoning-capable LLMs and autonomous agents on memory- and power-limited edge devices

Key finding