throttLL'eM: cut LLM inference energy by throttling GPU frequency and right-sizing instances while keeping latency SLOs

August 5, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

Links

Abstract / PDF

Why It Matters For Business

throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.

Summary TLDR

throttLL'eM is a serving system that predicts future KV-cache use and batch sizes, then picks the lowest GPU frequency and engine size that still meets latency SLOs. On A100/A10G tests with LLaMa-family models and an Azure invocation trace, it cuts energy up to 43.8% and boosts tokens-per-Joule by ~1.7× while keeping p99 and token-TBT SLOs in most cases. The system relies on a fast Gradient-Boosted tree predictor (≈3 ms inference) and cheap projections (≈2 ms) but needs profiling and accurate length prediction to avoid SLO risk.

Problem Statement

LLM inference uses a lot of GPU power. Static power caps or coarse autoscaling either break latency SLOs or leave energy savings on the table. We need a runtime method to dynamically reduce GPU energy (via frequency/inference engine sizing) while provably meeting latency and token-generation SLOs.

Main Contribution

throttLL'eM: an SLO-aware framework that combines generation-length prediction, KV-cache and batch-size projection, an ML performance predictor, GPU frequency selection, and autoscaling.

A lightweight ML-based iteration-level performance model (Gradient-Boosted trees) that predicts iterations-per-second (IPS) with R2 ≥ 0.97 and <1 IPS MAE.

Empirical study showing KV-cache and batch dynamics strongly affect throughput, and that frequency/batch trade-offs produce an energy-efficiency sweet spot.

Key Findings

Performance model predicts iteration throughput accurately.

NumbersR2 ≥ 0.97; MAE < 1 IPS (Table 3)

KV-cache usage strongly correlates with performance.

NumbersPearson corr 0.92 (KV vs TBT), -0.92 (KV vs Throughput)

KV and batch projection are precise in practice.

Numbersavg errors: batch 0.19%, KV 2.26%, per-iteration drift 0.43 ms

Energy drops substantially with combined throttling + autoscaling.

Numbersup to 43.8% energy reduction; avg 24.7% without autoscaling

Energy-efficiency (tokens per Joule) improves over baseline.

Numbers1.71×–1.78× TPJ improvement; baseline 0.69 TPJ → throttLL'eM 1.19–1.23 TPJ

Results

Accuracy

Value≥0.97 (90/10 split), ≥0.96 (10/90)

Performance model error (MAE)

Value<1 IPS (MAE 0.69–0.99 iters/s)

Projection errors

ValueBatch 0.19% avg; KV 2.26% avg; drift 0.43 ms/iteration

Energy consumption reduction

Valueup to 43.8% less energy (with autoscaling)

BaselineTriton baseline at max freq

Energy-efficiency (TPJ)

Value0.69 → 1.19–1.23 TPJ (1.71×–1.78×)

BaselineTriton TP4

Scheduler + throttling runtime

Value35 ms combined under heavy load

Who Should Care

What To Try In 7 Days

Profile your engine at a few frequencies to map IPS vs KV and batch — needed for the predictive model.

Add a lightweight per-iteration IPS predictor (GBDT/XGBoost) and measure prediction latency (~3 ms).

Start with non-critical traffic and apply frequency scaling in a conservative band near the identified efficiency sweet spot (e.g., ~1050 MHz) to measure energy vs latency trade-of

Agent Features

Memory

  • KV cache projection (tracks future allocated KV blocks)

Tool Use

  • GPU DVFS (frequency scaling)
  • Autoscaling (tensor-parallel engine sizing)
  • Admission control / scheduling

Optimization Features

Token Efficiency

  • Batch-size increase due to lower frequencies (avg +24.5% at 0% error)
  • Reduced iterations (avg −24.3% at 0% error)

Infra Optimization

  • Prefer tensor parallel (TP) intra-node configurations for efficiency

System Optimization

  • Shadow instancing to hide provisioning latency
  • Binary search over frequencies per admission

Inference Optimization

  • GPU frequency scaling (DVFS)
  • Autoscaling engine size (tensor parallelism)
  • Iteration-level performance prediction
  • KV-cache and batch-size projection
  • Admission control to meet TBT and E2E SLOs

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires per-model profiling (can take up to a day for large models).
  • Relies on generation-length predictor; high errors force conservative frequency choices and reduce savings.
  • Tensor-parallel approach favors intra-node scaling and may not generalize to multi-node setups without extra orchestration.

When Not To Use

  • When the generation-length predictor is unavailable or highly inaccurate for your workload.
  • If startup/provisioning latency is extremely high and autoscaling cannot be masked (no shadowing).
  • For engines constrained to very small KV capacity where throttling cannot meet SLOs (e.g., limited batch/KV blocks).

Failure Modes

  • Under-prediction of generation length → SLO violations unless conservative margin used.
  • Excessive queuing from admission checks → increased Time-To-First-Token (TTFT) and user-perceived latency.
  • Rapid traffic spikes faster than autoscaling/grace periods → transient p99 violations.

Core Entities

Models

  • LLaMa2-13B
  • LLaMa3-8B
  • LLaMa3-70B
  • Mixtral 8x7B

Metrics

  • IPS (iterations per second)
  • TBT (time-between-tokens)
  • E2E latency (p99)
  • TPJ (tokens per Joule)
  • R2, MAE, MAPE (prediction metrics)

Datasets

  • Azure LLM inference invocation trace (used for workload shape)

Benchmarks

  • MLPerf (used to define E2E SLOs)

Context Entities

Datasets

  • Synthetic queries matched to Azure trace (privacy-preserving)