Measured energy and throughput trade-offs for multi‑GPU LLaMA inference on V100/A100

October 4, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

12

Authors

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay Gadepally

Links

Abstract / PDF

Why It Matters For Business

LLM inference can cost hundreds to thousands of watts per deployed model instance; choosing GPU type, shard layout, and power caps materially changes operating costs and latency.

Summary TLDR

This paper measures compute, power, and throughput for LLaMA models (7B/13B/65B) using NVIDIA V100 and A100 GPUs. Key takeaways: A100s give higher throughput but higher instantaneous power; LLaMA 65B inference draws ~300 W to 1 kW depending on shard count; energy per decoded token is about 3–4 J for common settings; model sharding increases wattage and often underuses GPU memory (20%–27%). Power capping (250→175W) cut total energy ~23% while adding ~6–7% latency. Results are empirical baselines, limited to LLaMA, two datasets, and no quantization or other inference optimizations.

Problem Statement

Large language models are widely used but their inference energy and hardware costs are under-measured. The paper aims to quantify energy use, throughput, and resource utilization for multi-GPU inference at realistic scales, so practitioners can plan costs and trade-offs.

Main Contribution

Empirical benchmarks of inference throughput and energy for LLaMA 7B, 13B, and 65B on NVIDIA V100 and A100 GPUs.

Multi-node, multi-GPU (up to 32 GPUs) measurements showing how sharding and batch size affect watts, joules per token, and response energy.

A short study of GPU power capping impact showing sizable energy savings with modest latency increases, and utilization stats that reveal low GPU memory use enabling co-location opportunities.

Key Findings

A100 gives higher throughput but uses more power per second than V100.

Numbers7B: ~2× throughput gain on A100; 13B: ~1.25× (Fig.2, Fig.3)

Inference power draw for LLaMA 65B ranges widely with shard count.

NumbersEnergy per second ~300 W (8 GPUs) to ~1,000 W (32 GPUs)

Energy per decoded token is roughly a few joules under tested settings.

NumbersEnergy per token ≈ 3–4 J for max gen length 512 (Fig.6)

Sharding (more GPUs) increases instant power and often reduces energy efficiency.

NumbersEnergy per second increases with shard count even at same batch size (Sec. IV-B)

GPU power capping can cut total energy substantially with modest latency hit.

Numbers250W→175W: inference time +6.7%, total energy −23.2%; 150W: time +19.5%, energy −~33% (Table III)

GPU memory is underused in sharded 65B runs, while SM (compute) is highly used.

NumbersMemory util ~20%–27%; SM util ~94%–99% (Tables IV–V)

Minimum hardware to run 65B without compression is substantial.

NumbersAt least 8× V100 32GB or 4× A100 80GB required (Table II)

Results

A100 vs V100 throughput

Value7B ~2× faster on A100; 13B ~1.25× faster on A100

BaselineV100 throughput

Power draw (LLaMA 65B)

Value≈300 W at low shard counts to ≈1000 W at high shard counts

Baseline8 GPU shards

Energy per token

Value≈3–4 J per decoded token (max gen length 512)

BaselineMax gen length 512

Power cap effects

Value250W→175W: time +6.7%, energy −23.2%; 150W: time +19.5%, energy −~33%

Baseline250W cap

GPU utilization

ValueSM util ≈94%–99%; memory util ≈6%–27% depending on shards

BaselineVarious shard configs (A100/V100 Tables IV–V)

Minimum hardware to run 65B

Value8× V100 32GB or 4× A100 80GB

BaselineSingle GPU

Who Should Care

What To Try In 7 Days

Measure joules-per-token on your own LLM tasks to get a baseline cost-per-query.

If using sharded large models, test fewer shards to find where energy-per-token improves.

Try conservative power capping (e.g., 175W) on A100s in a staging environment and measure latency and energy trade-offs.

Optimization Features

Infra Optimization

  • Choose A100 for throughput but check energy per token
  • Avoid unnecessary sharding to reduce wattage

Model Optimization

  • Model sharding (FairScale) used for multi-GPU inference
  • Paper mentions quantization and distillation as future/related techniques

System Optimization

  • Co-location potential due to low memory utilization
  • Use of MPS/MIG for GPU sharing suggested

Inference Optimization

  • GPU power capping (tested at 250/175/150W)
  • Batch size tuning across shard configs

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments limited to LLaMA models; results may not generalize to other architectures.
  • Did not measure output correctness or task quality when trading energy vs latency.
  • No experiments with quantization, distillation, or other inference-specific optimizations.
  • Energy aggregation multiplies rank-0 energy by node count which may hide node-level variance.

When Not To Use

  • Do not generalize the exact joules-per-token numbers to different models, hardware, or workloads without remeasuring.
  • Avoid using these power/latency trade-offs as definitive guidance for user-facing SLAs without testing on your workload.

Failure Modes

  • Sharding increases instantaneous power and can worsen energy-per-token if not tuned.
  • Power capping can sharply increase latency if set too low for the workload.
  • Low memory utilization assumptions may not hold for different input sizes or other model variants.

Core Entities

Models

  • LLaMA 7B
  • LLaMA 13B
  • LLaMA 65B

Metrics

  • Words/sec
  • Tokens/sec
  • Responses/sec
  • Energy per second (W)
  • Energy per token (J)
  • Energy per response (J)
  • GPU SM utilization (%)
  • GPU memory utilization (%)

Datasets

  • Alpaca (instruction-following, 52k sampled 4,096)
  • GSM8K (math, sampled 4,096)

Benchmarks

  • Throughput (words/tokens/responses per second)
  • Energy (W), energy per token (J), energy per response (J)