Overview
The system combines known building blocks (DVFS, autoscaling, GBDT predictors) in a new, LLM-aware control loop; results are consistent across GPUs and models but depend on profiling and length-prediction quality.
Citations0
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.
Who Should Care
Summary TLDR
throttLL'eM is a serving system that predicts future KV-cache use and batch sizes, then picks the lowest GPU frequency and engine size that still meets latency SLOs. On A100/A10G tests with LLaMa-family models and an Azure invocation trace, it cuts energy up to 43.8% and boosts tokens-per-Joule by ~1.7× while keeping p99 and token-TBT SLOs in most cases. The system relies on a fast Gradient-Boosted tree predictor (≈3 ms inference) and cheap projections (≈2 ms) but needs profiling and accurate length prediction to avoid SLO risk.
Problem Statement
LLM inference uses a lot of GPU power. Static power caps or coarse autoscaling either break latency SLOs or leave energy savings on the table. We need a runtime method to dynamically reduce GPU energy (via frequency/inference engine sizing) while provably meeting latency and token-generation SLOs.
Main Contribution
throttLL'eM: an SLO-aware framework that combines generation-length prediction, KV-cache and batch-size projection, an ML performance predictor, GPU frequency selection, and autoscaling.
A lightweight ML-based iteration-level performance model (Gradient-Boosted trees) that predicts iterations-per-second (IPS) with R2 ≥ 0.97 and <1 IPS MAE.
Key Findings
Performance model predicts iteration throughput accurately.
KV-cache usage strongly correlates with performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ≥0.97 (90/10 split), ≥0.96 (10/90) | — | — | Table 3, evaluated LLM engines | R2 scores in Table 3 | Sec.5.2 Table 3 |
| Performance model error (MAE) | <1 IPS (MAE 0.69–0.99 iters/s) | — | — | Table 3 | MAE per engine in Table 3 | Sec.5.2 Table 3 |
What To Try In 7 Days
Profile your engine at a few frequencies to map IPS vs KV and batch — needed for the predictive model.
Add a lightweight per-iteration IPS predictor (GBDT/XGBoost) and measure prediction latency (~3 ms).
Start with non-critical traffic and apply frequency scaling in a conservative band near the identified efficiency sweet spot (e.g., ~1050 MHz) to measure energy vs latency trade-of
Agent Features
Memory
Tool Use
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires per-model profiling (can take up to a day for large models).
Relies on generation-length predictor; high errors force conservative frequency choices and reduce savings.
When Not To Use
When the generation-length predictor is unavailable or highly inaccurate for your workload.
If startup/provisioning latency is extremely high and autoscaling cannot be masked (no shadowing).
Failure Modes
Under-prediction of generation length → SLO violations unless conservative margin used.
Excessive queuing from admission checks → increased Time-To-First-Token (TTFT) and user-perceived latency.

