throttLL'eM: cut LLM inference energy by throttling GPU frequency and right-sizing instances while keeping latency SLOs

August 5, 20247 min

Overview

Decision SnapshotReady For Pilot

The system combines known building blocks (DVFS, autoscaling, GBDT predictors) in a new, LLM-aware control loop; results are consistent across GPUs and models but depend on profiling and length-prediction quality.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

Links

Abstract / PDF / Code

Why It Matters For Business

throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.

Who Should Care

Summary TLDR

throttLL'eM is a serving system that predicts future KV-cache use and batch sizes, then picks the lowest GPU frequency and engine size that still meets latency SLOs. On A100/A10G tests with LLaMa-family models and an Azure invocation trace, it cuts energy up to 43.8% and boosts tokens-per-Joule by ~1.7× while keeping p99 and token-TBT SLOs in most cases. The system relies on a fast Gradient-Boosted tree predictor (≈3 ms inference) and cheap projections (≈2 ms) but needs profiling and accurate length prediction to avoid SLO risk.

Problem Statement

LLM inference uses a lot of GPU power. Static power caps or coarse autoscaling either break latency SLOs or leave energy savings on the table. We need a runtime method to dynamically reduce GPU energy (via frequency/inference engine sizing) while provably meeting latency and token-generation SLOs.

Main Contribution

throttLL'eM: an SLO-aware framework that combines generation-length prediction, KV-cache and batch-size projection, an ML performance predictor, GPU frequency selection, and autoscaling.

A lightweight ML-based iteration-level performance model (Gradient-Boosted trees) that predicts iterations-per-second (IPS) with R2 ≥ 0.97 and <1 IPS MAE.

Key Findings

Performance model predicts iteration throughput accurately.

NumbersR2 ≥ 0.97; MAE < 1 IPS (Table 3)

Practical UseUse a small tree model to forecast per-iteration IPS in ~3 ms and guide safe frequency reductions.

Evidence RefSec.5.2 Table 3

KV-cache usage strongly correlates with performance.

NumbersPearson corr 0.92 (KV vs TBT), -0.92 (KV vs Throughput)

Practical UseProject KV cache blocks to predict slowdown and avoid overcommitting memory that would spike latency.

Evidence RefSec.3.2 Fig.3d

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy≥0.97 (90/10 split), ≥0.96 (10/90)Table 3, evaluated LLM enginesR2 scores in Table 3Sec.5.2 Table 3
Performance model error (MAE)<1 IPS (MAE 0.690.99 iters/s)Table 3MAE per engine in Table 3Sec.5.2 Table 3

What To Try In 7 Days

Profile your engine at a few frequencies to map IPS vs KV and batch — needed for the predictive model.

Add a lightweight per-iteration IPS predictor (GBDT/XGBoost) and measure prediction latency (~3 ms).

Start with non-critical traffic and apply frequency scaling in a conservative band near the identified efficiency sweet spot (e.g., ~1050 MHz) to measure energy vs latency trade-of

Agent Features

Memory
KV cache projection (tracks future allocated KV blocks)
Tool Use
GPU DVFS (frequency scaling)Autoscaling (tensor-parallel engine sizing)Admission control / scheduling

Optimization Features

Token Efficiency
Batch-size increase due to lower frequencies (avg +24.5% at 0% error)Reduced iterations (avg −24.3% at 0% error)
Infra Optimization
Prefer tensor parallel (TP) intra-node configurations for efficiency
System Optimization
Shadow instancing to hide provisioning latencyBinary search over frequencies per admission
Inference Optimization
GPU frequency scaling (DVFS)Autoscaling engine size (tensor parallelism)Iteration-level performance predictionKV-cache and batch-size projectionAdmission control to meet TBT and E2E SLOs

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires per-model profiling (can take up to a day for large models).

Relies on generation-length predictor; high errors force conservative frequency choices and reduce savings.

When Not To Use

When the generation-length predictor is unavailable or highly inaccurate for your workload.

If startup/provisioning latency is extremely high and autoscaling cannot be masked (no shadowing).

Failure Modes

Under-prediction of generation length → SLO violations unless conservative margin used.

Excessive queuing from admission checks → increased Time-To-First-Token (TTFT) and user-perceived latency.

Core Entities

Models

LLaMa2-13BLLaMa3-8BLLaMa3-70BMixtral 8x7B

Metrics

IPS (iterations per second)TBT (time-between-tokens)E2E latency (p99)TPJ (tokens per Joule)R2, MAE, MAPE (prediction metrics)

Datasets

Azure LLM inference invocation trace (used for workload shape)

Benchmarks

MLPerf (used to define E2E SLOs)

Context Entities

Datasets

Synthetic queries matched to Azure trace (privacy-preserving)