Measured energy and throughput trade-offs for multi‑GPU LLaMA inference on V100/A100

October 4, 20239 min

Overview

Decision SnapshotReady For Pilot

Solid empirical baselines for LLaMA inference on specific GPUs and datasets; limited generality beyond tested configs and no optimization techniques like quantization were applied.

Citations12

Evidence Strength0.75

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 40%

Authors

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay Gadepally

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM inference can cost hundreds to thousands of watts per deployed model instance; choosing GPU type, shard layout, and power caps materially changes operating costs and latency.

Who Should Care

Summary TLDR

This paper measures compute, power, and throughput for LLaMA models (7B/13B/65B) using NVIDIA V100 and A100 GPUs. Key takeaways: A100s give higher throughput but higher instantaneous power; LLaMA 65B inference draws ~300 W to 1 kW depending on shard count; energy per decoded token is about 3–4 J for common settings; model sharding increases wattage and often underuses GPU memory (20%–27%). Power capping (250→175W) cut total energy ~23% while adding ~6–7% latency. Results are empirical baselines, limited to LLaMA, two datasets, and no quantization or other inference optimizations.

Problem Statement

Large language models are widely used but their inference energy and hardware costs are under-measured. The paper aims to quantify energy use, throughput, and resource utilization for multi-GPU inference at realistic scales, so practitioners can plan costs and trade-offs.

Main Contribution

Empirical benchmarks of inference throughput and energy for LLaMA 7B, 13B, and 65B on NVIDIA V100 and A100 GPUs.

Multi-node, multi-GPU (up to 32 GPUs) measurements showing how sharding and batch size affect watts, joules per token, and response energy.

Key Findings

A100 gives higher throughput but uses more power per second than V100.

Numbers7B: ~2× throughput gain on A100; 13B: ~1.25× (Fig.2, Fig.3)

Practical UseExpect faster responses on A100s but higher instantaneous energy bills; evaluate cost per token, not only latency.

Evidence RefFig.2, Fig.3

Inference power draw for LLaMA 65B ranges widely with shard count.

NumbersEnergy per second ~300 W (8 GPUs) to ~1,000 W (32 GPUs)

Practical UsePlan power capacity and cooling for hundreds to thousands of watts per model instance when sharding across many GPUs.

Evidence RefSec. IV-B, Fig.4, Fig.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
A100 vs V100 throughput7B ~2× faster on A100; 13B ~1.25× faster on A100V100 throughput7B +100% ; 13B +25%Alpaca & GSM8K (Fig.2)Measured words/tokens/responses per second on V100 and A100Fig.2
Power draw (LLaMA 65B)≈300 W at low shard counts to ≈1000 W at high shard counts8 GPU shardsUp to ~3× increaseAlpaca & GSM8K (Fig.4, Fig.5)Aggregate GPU energy divided by run timeFig.4, Fig.5

What To Try In 7 Days

Measure joules-per-token on your own LLM tasks to get a baseline cost-per-query.

If using sharded large models, test fewer shards to find where energy-per-token improves.

Try conservative power capping (e.g., 175W) on A100s in a staging environment and measure latency and energy trade-offs.

Optimization Features

Infra Optimization
Choose A100 for throughput but check energy per tokenAvoid unnecessary sharding to reduce wattage
Model Optimization
Model sharding (FairScale) used for multi-GPU inferencePaper mentions quantization and distillation as future/related techniques
System Optimization
Co-location potential due to low memory utilizationUse of MPS/MIG for GPU sharing suggested
Inference Optimization
GPU power capping (tested at 250/175/150W)Batch size tuning across shard configs

Reproducibility

Risks & Boundaries

Limitations

Experiments limited to LLaMA models; results may not generalize to other architectures.

Did not measure output correctness or task quality when trading energy vs latency.

When Not To Use

Do not generalize the exact joules-per-token numbers to different models, hardware, or workloads without remeasuring.

Avoid using these power/latency trade-offs as definitive guidance for user-facing SLAs without testing on your workload.

Failure Modes

Sharding increases instantaneous power and can worsen energy-per-token if not tuned.

Power capping can sharply increase latency if set too low for the workload.

Core Entities

Models

LLaMA 7BLLaMA 13BLLaMA 65B

Metrics

Words/secTokens/secResponses/secEnergy per second (W)Energy per token (J)Energy per response (J)GPU SM utilization (%)GPU memory utilization (%)

Datasets

Alpaca (instruction-following, 52k sampled 4,096)GSM8K (math, sampled 4,096)

Benchmarks

Throughput (words/tokens/responses per second)Energy (W), energy per token (J), energy per response (J)