Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.
Summary TLDR
throttLL'eM is a serving system that predicts future KV-cache use and batch sizes, then picks the lowest GPU frequency and engine size that still meets latency SLOs. On A100/A10G tests with LLaMa-family models and an Azure invocation trace, it cuts energy up to 43.8% and boosts tokens-per-Joule by ~1.7× while keeping p99 and token-TBT SLOs in most cases. The system relies on a fast Gradient-Boosted tree predictor (≈3 ms inference) and cheap projections (≈2 ms) but needs profiling and accurate length prediction to avoid SLO risk.
Problem Statement
LLM inference uses a lot of GPU power. Static power caps or coarse autoscaling either break latency SLOs or leave energy savings on the table. We need a runtime method to dynamically reduce GPU energy (via frequency/inference engine sizing) while provably meeting latency and token-generation SLOs.
Main Contribution
throttLL'eM: an SLO-aware framework that combines generation-length prediction, KV-cache and batch-size projection, an ML performance predictor, GPU frequency selection, and autoscaling.
A lightweight ML-based iteration-level performance model (Gradient-Boosted trees) that predicts iterations-per-second (IPS) with R2 ≥ 0.97 and <1 IPS MAE.
Empirical study showing KV-cache and batch dynamics strongly affect throughput, and that frequency/batch trade-offs produce an energy-efficiency sweet spot.
Key Findings
Performance model predicts iteration throughput accurately.
KV-cache usage strongly correlates with performance.
KV and batch projection are precise in practice.
Energy drops substantially with combined throttling + autoscaling.
Energy-efficiency (tokens per Joule) improves over baseline.
Results
Accuracy
Performance model error (MAE)
Projection errors
Energy consumption reduction
Energy-efficiency (TPJ)
Scheduler + throttling runtime
Who Should Care
What To Try In 7 Days
Profile your engine at a few frequencies to map IPS vs KV and batch — needed for the predictive model.
Add a lightweight per-iteration IPS predictor (GBDT/XGBoost) and measure prediction latency (~3 ms).
Start with non-critical traffic and apply frequency scaling in a conservative band near the identified efficiency sweet spot (e.g., ~1050 MHz) to measure energy vs latency trade-of
Agent Features
Memory
- KV cache projection (tracks future allocated KV blocks)
Tool Use
- GPU DVFS (frequency scaling)
- Autoscaling (tensor-parallel engine sizing)
- Admission control / scheduling
Optimization Features
Token Efficiency
- Batch-size increase due to lower frequencies (avg +24.5% at 0% error)
- Reduced iterations (avg −24.3% at 0% error)
Infra Optimization
- Prefer tensor parallel (TP) intra-node configurations for efficiency
System Optimization
- Shadow instancing to hide provisioning latency
- Binary search over frequencies per admission
Inference Optimization
- GPU frequency scaling (DVFS)
- Autoscaling engine size (tensor parallelism)
- Iteration-level performance prediction
- KV-cache and batch-size projection
- Admission control to meet TBT and E2E SLOs
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires per-model profiling (can take up to a day for large models).
- Relies on generation-length predictor; high errors force conservative frequency choices and reduce savings.
- Tensor-parallel approach favors intra-node scaling and may not generalize to multi-node setups without extra orchestration.
When Not To Use
- When the generation-length predictor is unavailable or highly inaccurate for your workload.
- If startup/provisioning latency is extremely high and autoscaling cannot be masked (no shadowing).
- For engines constrained to very small KV capacity where throttling cannot meet SLOs (e.g., limited batch/KV blocks).
Failure Modes
- Under-prediction of generation length → SLO violations unless conservative margin used.
- Excessive queuing from admission checks → increased Time-To-First-Token (TTFT) and user-perceived latency.
- Rapid traffic spikes faster than autoscaling/grace periods → transient p99 violations.
Core Entities
Models
- LLaMa2-13B
- LLaMa3-8B
- LLaMa3-70B
- Mixtral 8x7B
Metrics
- IPS (iterations per second)
- TBT (time-between-tokens)
- E2E latency (p99)
- TPJ (tokens per Joule)
- R2, MAE, MAPE (prediction metrics)
Datasets
- Azure LLM inference invocation trace (used for workload shape)
Benchmarks
- MLPerf (used to define E2E SLOs)
Context Entities
Datasets
- Synthetic queries matched to Azure trace (privacy-preserving)

