Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
12
Why It Matters For Business
LLM inference can cost hundreds to thousands of watts per deployed model instance; choosing GPU type, shard layout, and power caps materially changes operating costs and latency.
Summary TLDR
This paper measures compute, power, and throughput for LLaMA models (7B/13B/65B) using NVIDIA V100 and A100 GPUs. Key takeaways: A100s give higher throughput but higher instantaneous power; LLaMA 65B inference draws ~300 W to 1 kW depending on shard count; energy per decoded token is about 3–4 J for common settings; model sharding increases wattage and often underuses GPU memory (20%–27%). Power capping (250→175W) cut total energy ~23% while adding ~6–7% latency. Results are empirical baselines, limited to LLaMA, two datasets, and no quantization or other inference optimizations.
Problem Statement
Large language models are widely used but their inference energy and hardware costs are under-measured. The paper aims to quantify energy use, throughput, and resource utilization for multi-GPU inference at realistic scales, so practitioners can plan costs and trade-offs.
Main Contribution
Empirical benchmarks of inference throughput and energy for LLaMA 7B, 13B, and 65B on NVIDIA V100 and A100 GPUs.
Multi-node, multi-GPU (up to 32 GPUs) measurements showing how sharding and batch size affect watts, joules per token, and response energy.
A short study of GPU power capping impact showing sizable energy savings with modest latency increases, and utilization stats that reveal low GPU memory use enabling co-location opportunities.
Key Findings
A100 gives higher throughput but uses more power per second than V100.
Inference power draw for LLaMA 65B ranges widely with shard count.
Energy per decoded token is roughly a few joules under tested settings.
Sharding (more GPUs) increases instant power and often reduces energy efficiency.
GPU power capping can cut total energy substantially with modest latency hit.
GPU memory is underused in sharded 65B runs, while SM (compute) is highly used.
Minimum hardware to run 65B without compression is substantial.
Results
A100 vs V100 throughput
Power draw (LLaMA 65B)
Energy per token
Power cap effects
GPU utilization
Minimum hardware to run 65B
Who Should Care
What To Try In 7 Days
Measure joules-per-token on your own LLM tasks to get a baseline cost-per-query.
If using sharded large models, test fewer shards to find where energy-per-token improves.
Try conservative power capping (e.g., 175W) on A100s in a staging environment and measure latency and energy trade-offs.
Optimization Features
Infra Optimization
- Choose A100 for throughput but check energy per token
- Avoid unnecessary sharding to reduce wattage
Model Optimization
- Model sharding (FairScale) used for multi-GPU inference
- Paper mentions quantization and distillation as future/related techniques
System Optimization
- Co-location potential due to low memory utilization
- Use of MPS/MIG for GPU sharing suggested
Inference Optimization
- GPU power capping (tested at 250/175/150W)
- Batch size tuning across shard configs
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments limited to LLaMA models; results may not generalize to other architectures.
- Did not measure output correctness or task quality when trading energy vs latency.
- No experiments with quantization, distillation, or other inference-specific optimizations.
- Energy aggregation multiplies rank-0 energy by node count which may hide node-level variance.
When Not To Use
- Do not generalize the exact joules-per-token numbers to different models, hardware, or workloads without remeasuring.
- Avoid using these power/latency trade-offs as definitive guidance for user-facing SLAs without testing on your workload.
Failure Modes
- Sharding increases instantaneous power and can worsen energy-per-token if not tuned.
- Power capping can sharply increase latency if set too low for the workload.
- Low memory utilization assumptions may not hold for different input sizes or other model variants.
Core Entities
Models
- LLaMA 7B
- LLaMA 13B
- LLaMA 65B
Metrics
- Words/sec
- Tokens/sec
- Responses/sec
- Energy per second (W)
- Energy per token (J)
- Energy per response (J)
- GPU SM utilization (%)
- GPU memory utilization (%)
Datasets
- Alpaca (instruction-following, 52k sampled 4,096)
- GSM8K (math, sampled 4,096)
Benchmarks
- Throughput (words/tokens/responses per second)
- Energy (W), energy per token (J), energy per response (J)

