Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
EfficientLLM translates technical trade-offs into concrete cost, latency, and energy numbers so teams can choose methods that match budgets and SLAs instead of guessing.
Summary TLDR
EfficientLLM is a large, hardware-grounded benchmark that measures efficiency trade-offs for LLMs across three lifecycle stages: architecture pretraining (attention variants and MoE), fine-tuning (PEFT variants like LoRA/RSLoRA/DoRA), and inference (bfloat16/float16 and post‑training int4 quantization). Run on a 48×GH200 + 8×H200 cluster, the study evaluates 100+ model–technique pairs using six system-aware metrics (memory, compute, latency, throughput, energy, compression). Main practical takeaways: no single method wins all axes; int4 yields ~3.9× memory/energy gains at ~3–5% average-task drop; MoE improves quality and reduces FLOPs per token but increases VRAM (~40%); PEFT choice should,
Problem Statement
Practitioners lack a single, large-scale empirical guide that measures real-world efficiency trade-offs (memory, latency, throughput, energy, compression) across architectures, fine-tuning methods, and low‑bit inference on modern GPUs. Without that, teams choose techniques by hearsay and risk suboptimal cost or performance in production.
Main Contribution
A unified, three-axis benchmark (architecture pretraining, fine-tuning, bit‑width quantization) with real-hardware measurements on GH200/H200 clusters.
A concise metric suite tailored to deployment: AMU, PCU, AL, TT/ST/IT, AEC, and MCR that captures memory, compute, latency, throughput, energy, and compression trade-offs.
A 100+ model-technique empirical study (0.5B–72B params) comparing efficient attention (MQA/GQA/MLA/NSA), MoE, PEFT variants (LoRA/RSLoRA/DoRA/LoRA-plus), and post-training quantization to int4.
Open release of evaluation pipelines, datasets, and leaderboards to help reproduce and extend efficiency comparisons.
Key Findings
No single technique is best across all efficiency axes.
Post-training int4 quantization reduces memory and energy up to 3.9× while causing ~3–5% average-task score drop on evaluated benchmarks.
Mixture‑of‑Experts (MoE) improves quality and cuts training FLOPs but raises VRAM by ≈40%.
Efficient attention variants have different optima: MQA minimizes memory+latency; MLA gives lowest perplexity; NSA minimizes energy per token on tested scales.
PEFT methods scale differently: LoRA/variants excel at 1–3B, RSLoRA outperforms at ≥14B, and parameter freezing yields the lowest tuning latency (≈3× faster).
On Hopper-class GPUs, bfloat16 usually beats float16 by ~6% latency and ~9% energy.
Results
int4 quantization compression
Average task score drop from int4
MoE VRAM increase
PEFT latency reduction via freezing
bfloat16 vs float16 efficiency
Who Should Care
What To Try In 7 Days
Run int4 post-training quantization on a production-weight model and measure task accuracy vs memory/throughput to estimate 3–4× memory savings.
If tuning a >14B model, trial RSLoRA and compare latency/energy to LoRA; use freezing if interactive tuning speed is essential.
For memory-limited serving, swap standard attention for MQA and run a short latency+memory profile to confirm lower AMU and latency.
Optimization Features
Token Efficiency
- Data filtering and deduplication (FineWeb-Edu)
- Curriculum learning
Infra Optimization
- Benchmarking on GH200/H200 GPU clusters
- NVLink/InfiniBand high-bandwidth interconnect
Model Optimization
- MoE
- Efficient attention (MQA, GQA, MLA, NSA)
- Attention-free backbones (Mamba, RWKV)
System Optimization
- DeepSpeed ZeRO (memory offload/3)
- Megatron-Core 3D parallelism
- FlashAttention kernels
Training Optimization
- LoRA
- Mixed-precision training (bfloat16/float16)
- Scaling-law informed pretraining (Chinchilla-style)
Inference Optimization
- Post-training quantization to int4
- bfloat16-first precision strategy
- PagedAttention / FlashAttention kernels
Reproducibility
Code Urls
Data Urls
- FineWeb-Edu (described in paper; sampling instructions provided in repo)
- OpenO1-SFT (referenced dataset release notes in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments run on a specific GH200/H200 cluster—results may shift on TPUs or different GPU generations.
- Int8 results were excluded due to backend instability; int4 findings don't guarantee int8 behavior.
- Scope excludes systems-level scheduling, RLHF alignment costs, and some test-time acceleration techniques.
- Metric normalization (min-max) can obscure absolute differences between metrics; inspect raw numbers for economic decisions.
When Not To Use
- Do not use int4 quantization when small accuracy drops are unacceptable (e.g., high‑stakes medical reasoning) without task-specific validation.
- Avoid MoE if GPU memory is the binding resource or if your infra cannot handle expert parameters storage.
- Don't assume LoRA results at small scale will hold at very large scales; test RSLoRA/other methods for ≥14B models.
Failure Modes
- Int4 quantization causing catastrophic failure on precision‑sensitive tasks (math or long numerical reasoning).
- MoE routing overhead or memory blow-up when expert parameter storage exceeds device limits.
- PEFT methods that reduce latency but harm accuracy beyond acceptable thresholds in certain domains.
Core Entities
Models
- LLaMA-3 (1B/3B/8B/70B)
- DeepSeek-R1 distill variants (1.5B/8B/14B)
- Qwen-2.5 (7B/14B/32B/72B)
- Phi-3.5-mini/Phi-4 (3.5B/14B)
- Yi-34B
- Mistral-7B
- Mixtral (MoE) variants
- Mamba, Pythia, RWKV (attention-free/alternative backbones)
- DiT-style LVMs (various sizes)
- Stable Diffusion 3.5 (LVM)
Metrics
- AMU (Average Memory Utilization)
- PCU (Peak Compute Utilization)
- AL (Average Latency)
- TT (Token Throughput)
- ST (Sample Throughput)
- IT (Inference Throughput)
- AEC (Average Energy Consumption)
- MCR (Model Compression Rate)
- Perplexity (PPL)
- Tokens/s
Datasets
- FineWeb-Edu (350B sample)
- SFT
- ChatQA
- Disney Organized
- WikiArt Sargent
Benchmarks
- MMLU-Pro
- BBH
- GPQA
- IFEval
- MATH
- MuSR
- HumanEval
- HARDMath
- FID (vision quality)

