Measured energy and throughput trade-offs for multi‑GPU LLaMA inference on V100/A100

Overview

Decision SnapshotReady For Pilot

Solid empirical baselines for LLaMA inference on specific GPUs and datasets; limited generality beyond tested configs and no optimization techniques like quantization were applied.

Citations12

Evidence Strength0.75

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 40%

Authors

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay Gadepally

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM inference can cost hundreds to thousands of watts per deployed model instance; choosing GPU type, shard layout, and power caps materially changes operating costs and latency.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper measures compute, power, and throughput for LLaMA models (7B/13B/65B) using NVIDIA V100 and A100 GPUs. Key takeaways: A100s give higher throughput but higher instantaneous power; LLaMA 65B inference draws ~300 W to 1 kW depending on shard count; energy per decoded token is about 3–4 J for common settings; model sharding increases wattage and often underuses GPU memory (20%–27%). Power capping (250→175W) cut total energy ~23% while adding ~6–7% latency. Results are empirical baselines, limited to LLaMA, two datasets, and no quantization or other inference optimizations.

Problem Statement

Large language models are widely used but their inference energy and hardware costs are under-measured. The paper aims to quantify energy use, throughput, and resource utilization for multi-GPU inference at realistic scales, so practitioners can plan costs and trade-offs.

Main Contribution

Empirical benchmarks of inference throughput and energy for LLaMA 7B, 13B, and 65B on NVIDIA V100 and A100 GPUs.

Multi-node, multi-GPU (up to 32 GPUs) measurements showing how sharding and batch size affect watts, joules per token, and response energy.

Key Findings

A100 gives higher throughput but uses more power per second than V100.

Numbers7B: ~2× throughput gain on A100; 13B: ~1.25× (Fig.2, Fig.3)

Practical UseExpect faster responses on A100s but higher instantaneous energy bills; evaluate cost per token, not only latency.

Evidence RefFig.2, Fig.3

Inference power draw for LLaMA 65B ranges widely with shard count.

NumbersEnergy per second ~300 W (8 GPUs) to ~1,000 W (32 GPUs)

Practical UsePlan power capacity and cooling for hundreds to thousands of watts per model instance when sharding across many GPUs.

Evidence RefSec. IV-B, Fig.4, Fig.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
A100 vs V100 throughput	7B ~2× faster on A100; 13B ~1.25× faster on A100	V100 throughput	7B +100% ; 13B +25%	Alpaca & GSM8K (Fig.2)	Measured words/tokens/responses per second on V100 and A100	Fig.2
Power draw (LLaMA 65B)	≈300 W at low shard counts to ≈1000 W at high shard counts	8 GPU shards	Up to ~3× increase	Alpaca & GSM8K (Fig.4, Fig.5)	Aggregate GPU energy divided by run time	Fig.4, Fig.5

What To Try In 7 Days

Measure joules-per-token on your own LLM tasks to get a baseline cost-per-query.

If using sharded large models, test fewer shards to find where energy-per-token improves.

Try conservative power capping (e.g., 175W) on A100s in a staging environment and measure latency and energy trade-offs.

Optimization Features

Infra Optimization

Choose A100 for throughput but check energy per tokenAvoid unnecessary sharding to reduce wattage

Model Optimization

Model sharding (FairScale) used for multi-GPU inferencePaper mentions quantization and distillation as future/related techniques

System Optimization

Co-location potential due to low memory utilizationUse of MPS/MIG for GPU sharing suggested

Inference Optimization

GPU power capping (tested at 250/175/150W)Batch size tuning across shard configs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/facebookresearch/llama https://github.com/tatsu-lab/stanford_alpaca

Data URLs

https://github.com/tatsu-lab/stanford_alpaca https://github.com/openai/gsm8k (dataset referenced as GSM8K)

Risks & Boundaries

Limitations

Experiments limited to LLaMA models; results may not generalize to other architectures.

Did not measure output correctness or task quality when trading energy vs latency.

When Not To Use

Do not generalize the exact joules-per-token numbers to different models, hardware, or workloads without remeasuring.

Avoid using these power/latency trade-offs as definitive guidance for user-facing SLAs without testing on your workload.

Failure Modes

Sharding increases instantaneous power and can worsen energy-per-token if not tuned.

Power capping can sharply increase latency if set too low for the workload.

Core Entities

Models

LLaMA 7BLLaMA 13BLLaMA 65B

Metrics

Words/secTokens/secResponses/secEnergy per second (W)Energy per token (J)Energy per response (J)GPU SM utilization (%)GPU memory utilization (%)

Datasets

Alpaca (instruction-following, 52k sampled 4,096)GSM8K (math, sampled 4,096)

Benchmarks

Throughput (words/tokens/responses per second)Energy (W), energy per token (J), energy per response (J)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A100 gives higher throughput but uses more power per second than V100.

Inference power draw for LLaMA 65B ranges widely with shard count.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding