throttLL'eM: cut LLM inference energy by throttling GPU frequency and right-sizing instances while keeping latency SLOs

Overview

Decision SnapshotReady For Pilot

The system combines known building blocks (DVFS, autoscaling, GBDT predictors) in a new, LLM-aware control loop; results are consistent across GPUs and models but depend on profiling and length-prediction quality.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

Links

Abstract / PDF / Code

Why It Matters For Business

throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

throttLL'eM is a serving system that predicts future KV-cache use and batch sizes, then picks the lowest GPU frequency and engine size that still meets latency SLOs. On A100/A10G tests with LLaMa-family models and an Azure invocation trace, it cuts energy up to 43.8% and boosts tokens-per-Joule by ~1.7× while keeping p99 and token-TBT SLOs in most cases. The system relies on a fast Gradient-Boosted tree predictor (≈3 ms inference) and cheap projections (≈2 ms) but needs profiling and accurate length prediction to avoid SLO risk.

Problem Statement

LLM inference uses a lot of GPU power. Static power caps or coarse autoscaling either break latency SLOs or leave energy savings on the table. We need a runtime method to dynamically reduce GPU energy (via frequency/inference engine sizing) while provably meeting latency and token-generation SLOs.

Main Contribution

throttLL'eM: an SLO-aware framework that combines generation-length prediction, KV-cache and batch-size projection, an ML performance predictor, GPU frequency selection, and autoscaling.

A lightweight ML-based iteration-level performance model (Gradient-Boosted trees) that predicts iterations-per-second (IPS) with R2 ≥ 0.97 and <1 IPS MAE.

Key Findings

Performance model predicts iteration throughput accurately.

NumbersR2 ≥ 0.97; MAE < 1 IPS (Table 3)

Practical UseUse a small tree model to forecast per-iteration IPS in ~3 ms and guide safe frequency reductions.

Evidence RefSec.5.2 Table 3

KV-cache usage strongly correlates with performance.

NumbersPearson corr 0.92 (KV vs TBT), -0.92 (KV vs Throughput)

Practical UseProject KV cache blocks to predict slowdown and avoid overcommitting memory that would spike latency.

Evidence RefSec.3.2 Fig.3d

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≥0.97 (90/10 split), ≥0.96 (10/90)	—	—	Table 3, evaluated LLM engines	R2 scores in Table 3	Sec.5.2 Table 3
Performance model error (MAE)	<1 IPS (MAE 0.69–0.99 iters/s)	—	—	Table 3	MAE per engine in Table 3	Sec.5.2 Table 3

What To Try In 7 Days

Profile your engine at a few frequencies to map IPS vs KV and batch — needed for the predictive model.

Add a lightweight per-iteration IPS predictor (GBDT/XGBoost) and measure prediction latency (~3 ms).

Start with non-critical traffic and apply frequency scaling in a conservative band near the identified efficiency sweet spot (e.g., ~1050 MHz) to measure energy vs latency trade-of

Agent Features

Memory

KV cache projection (tracks future allocated KV blocks)

Tool Use

GPU DVFS (frequency scaling)Autoscaling (tensor-parallel engine sizing)Admission control / scheduling

Optimization Features

Token Efficiency

Batch-size increase due to lower frequencies (avg +24.5% at 0% error)Reduced iterations (avg −24.3% at 0% error)

Infra Optimization

Prefer tensor parallel (TP) intra-node configurations for efficiency

System Optimization

Shadow instancing to hide provisioning latencyBinary search over frequencies per admission

Inference Optimization

GPU frequency scaling (DVFS)Autoscaling engine size (tensor parallelism)Iteration-level performance predictionKV-cache and batch-size projectionAdmission control to meet TBT and E2E SLOs

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/WilliamBlaskowicz/throttLL-eM

Risks & Boundaries

Limitations

Requires per-model profiling (can take up to a day for large models).

Relies on generation-length predictor; high errors force conservative frequency choices and reduce savings.

When Not To Use

When the generation-length predictor is unavailable or highly inaccurate for your workload.

If startup/provisioning latency is extremely high and autoscaling cannot be masked (no shadowing).

Failure Modes

Under-prediction of generation length → SLO violations unless conservative margin used.

Excessive queuing from admission checks → increased Time-To-First-Token (TTFT) and user-perceived latency.

Core Entities

Models

LLaMa2-13BLLaMa3-8BLLaMa3-70BMixtral 8x7B

Metrics

IPS (iterations per second)TBT (time-between-tokens)E2E latency (p99)TPJ (tokens per Joule)R2, MAE, MAPE (prediction metrics)

Datasets

Azure LLM inference invocation trace (used for workload shape)

Benchmarks

MLPerf (used to define E2E SLOs)

Context Entities

Datasets

Synthetic queries matched to Azure trace (privacy-preserving)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Performance model predicts iteration throughput accurately.

KV-cache usage strongly correlates with performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Multi-agent system + rubric RL that writes and optimizes full end-to-end CUDA programs

Key finding

Practical guide to cutting cloud and AI infra costs 28–90% using instance choices, quantization, and FinOps

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

Key finding