Overview
Solid empirical baselines for LLaMA inference on specific GPUs and datasets; limited generality beyond tested configs and no optimization techniques like quantization were applied.
Citations12
Evidence Strength0.75
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
LLM inference can cost hundreds to thousands of watts per deployed model instance; choosing GPU type, shard layout, and power caps materially changes operating costs and latency.
Who Should Care
Summary TLDR
This paper measures compute, power, and throughput for LLaMA models (7B/13B/65B) using NVIDIA V100 and A100 GPUs. Key takeaways: A100s give higher throughput but higher instantaneous power; LLaMA 65B inference draws ~300 W to 1 kW depending on shard count; energy per decoded token is about 3–4 J for common settings; model sharding increases wattage and often underuses GPU memory (20%–27%). Power capping (250→175W) cut total energy ~23% while adding ~6–7% latency. Results are empirical baselines, limited to LLaMA, two datasets, and no quantization or other inference optimizations.
Problem Statement
Large language models are widely used but their inference energy and hardware costs are under-measured. The paper aims to quantify energy use, throughput, and resource utilization for multi-GPU inference at realistic scales, so practitioners can plan costs and trade-offs.
Main Contribution
Empirical benchmarks of inference throughput and energy for LLaMA 7B, 13B, and 65B on NVIDIA V100 and A100 GPUs.
Multi-node, multi-GPU (up to 32 GPUs) measurements showing how sharding and batch size affect watts, joules per token, and response energy.
Key Findings
A100 gives higher throughput but uses more power per second than V100.
Inference power draw for LLaMA 65B ranges widely with shard count.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| A100 vs V100 throughput | 7B ~2× faster on A100; 13B ~1.25× faster on A100 | V100 throughput | 7B +100% ; 13B +25% | Alpaca & GSM8K (Fig.2) | Measured words/tokens/responses per second on V100 and A100 | Fig.2 |
| Power draw (LLaMA 65B) | ≈300 W at low shard counts to ≈1000 W at high shard counts | 8 GPU shards | Up to ~3× increase | Alpaca & GSM8K (Fig.4, Fig.5) | Aggregate GPU energy divided by run time | Fig.4, Fig.5 |
What To Try In 7 Days
Measure joules-per-token on your own LLM tasks to get a baseline cost-per-query.
If using sharded large models, test fewer shards to find where energy-per-token improves.
Try conservative power capping (e.g., 175W) on A100s in a staging environment and measure latency and energy trade-offs.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments limited to LLaMA models; results may not generalize to other architectures.
Did not measure output correctness or task quality when trading energy vs latency.
When Not To Use
Do not generalize the exact joules-per-token numbers to different models, hardware, or workloads without remeasuring.
Avoid using these power/latency trade-offs as definitive guidance for user-facing SLAs without testing on your workload.
Failure Modes
Sharding increases instantaneous power and can worsen energy-per-token if not tuned.
Power capping can sharply increase latency if set too low for the workload.

