Overview
The paper provides many real-hardware measurements across modern GPUs and many model sizes, so recommendations are practical; results still reflect hardware and dataset scope (GH200/H200 and the chosen datasets).
Citations0
Evidence Strength0.85
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
EfficientLLM translates technical trade-offs into concrete cost, latency, and energy numbers so teams can choose methods that match budgets and SLAs instead of guessing.
Who Should Care
Summary TLDR
EfficientLLM is a large, hardware-grounded benchmark that measures efficiency trade-offs for LLMs across three lifecycle stages: architecture pretraining (attention variants and MoE), fine-tuning (PEFT variants like LoRA/RSLoRA/DoRA), and inference (bfloat16/float16 and post‑training int4 quantization). Run on a 48×GH200 + 8×H200 cluster, the study evaluates 100+ model–technique pairs using six system-aware metrics (memory, compute, latency, throughput, energy, compression). Main practical takeaways: no single method wins all axes; int4 yields ~3.9× memory/energy gains at ~3–5% average-task drop; MoE improves quality and reduces FLOPs per token but increases VRAM (~40%); PEFT choice should,
Problem Statement
Practitioners lack a single, large-scale empirical guide that measures real-world efficiency trade-offs (memory, latency, throughput, energy, compression) across architectures, fine-tuning methods, and low‑bit inference on modern GPUs. Without that, teams choose techniques by hearsay and risk suboptimal cost or performance in production.
Main Contribution
A unified, three-axis benchmark (architecture pretraining, fine-tuning, bit‑width quantization) with real-hardware measurements on GH200/H200 clusters.
A concise metric suite tailored to deployment: AMU, PCU, AL, TT/ST/IT, AEC, and MCR that captures memory, compute, latency, throughput, energy, and compression trade-offs.
Key Findings
No single technique is best across all efficiency axes.
Post-training int4 quantization reduces memory and energy up to 3.9× while causing ~3–5% average-task score drop on evaluated benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| int4 quantization compression | up to 3.9× MCR vs bf16 | bfloat16 | ≈3.9× smaller memory | LLama/DeepSeek/Qwen families (1.5B–34B) | Table 9 reports MCR ≈3.64–3.9 for int4 across multiple models | Table 9; Section 5.5 |
| Average task score drop from int4 | ≈3–5 percentage points drop | bfloat16 | ≈-3% to -5% average score | evaluated benchmarks (MMLU-Pro, BBH, GPQA, etc.) | Abstract and Section 2.1 note measured 3–5% average-task score drop | Abstract; Section 2.1 |
What To Try In 7 Days
Run int4 post-training quantization on a production-weight model and measure task accuracy vs memory/throughput to estimate 3–4× memory savings.
If tuning a >14B model, trial RSLoRA and compare latency/energy to LoRA; use freezing if interactive tuning speed is essential.
For memory-limited serving, swap standard attention for MQA and run a short latency+memory profile to confirm lower AMU and latency.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Experiments run on a specific GH200/H200 cluster—results may shift on TPUs or different GPU generations.
Int8 results were excluded due to backend instability; int4 findings don't guarantee int8 behavior.
When Not To Use
Do not use int4 quantization when small accuracy drops are unacceptable (e.g., high‑stakes medical reasoning) without task-specific validation.
Avoid MoE if GPU memory is the binding resource or if your infra cannot handle expert parameters storage.
Failure Modes
Int4 quantization causing catastrophic failure on precision‑sensitive tasks (math or long numerical reasoning).
MoE routing overhead or memory blow-up when expert parameter storage exceeds device limits.

