65 papers found

Monarch Mixer: replace attention and MLPs with sub-quadratic GEMM-friendly layers to speed long-context models

0.50
0.70
0.70
14

If you run models with long contexts or want lower parameter cost, M2 can cut compute or model size and improve throughput on many GPUs while keeping accuracy; expect implementation and kernel work before production parity on all hardware.

Key finding

M2-BERT matches BERT-base GLUE while cutting parameters.

Numbers: GLUE 79.9 vs 79.6; −27% params (M2 80M vs BERT 110M)

BurstGPT: 10.3M real-world LLM traces (Azure) to test and tune serving systems

0.70
0.40
0.60
7

Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.

Key finding

Real LLM traces are highly bursty and differ from common cloud/function workloads.

Numbers: Mean RPS: MAF 1.64 vs LLM conv 0.019, LLM API 0.21 (ChatGPT)

Serve thousands of LoRA adapters from one machine by paging adapters and batching LoRA compute

0.80
0.60
0.80
7

If you sell many small fine-tuned models (per-user or per-task), S-LoRA lets one machine host thousands of adapters, cutting GPU costs and raising throughput compared to naive merging or swapping.

Key finding

S-LoRA scales to thousands of adapters on one machine.

Numbers: Served 2,000 adapters on a single A100 (80GB) in experiments

W4A8KV4 (4-bit weight, 8-bit activation, 4-bit KV) plus system kernels to double LLM serving throughput on common GPUs

0.80
0.60
0.80
6

QServe turns 4-bit quantization into real GPU speedups and memory savings, cutting serving cost per token (authors report ~3× dollar cost reduction by using L40S+QServe versus A100+TensorRT-LLM).

Key finding

Prior 4-bit methods incur large runtime dequantization overhead on GPUs.

Numbers: 2090% runtime overhead reported for dequantization

Practical guide to cutting cloud and AI infra costs 28–90% using instance choices, quantization, and FinOps

0.80
0.40
0.90
4

Cloud and AI costs can be the largest operational line items; small architecture and model choices can cut spend by tens to hundreds of percent while preserving user experience.

Key finding

GPU compute often dominates early AI budgets.

Numbers: GPU = 4060% of technical budgets (first 2 years)

Serve large LLMs on mixed-GPU clusters with phase-aware partitioning and adaptive mixed-precision quantization

0.70
0.60
0.80
3

If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.

Key finding

LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.

Numbers: Up to 2.88× speed-up; 2.26× average speed-up (Table 4, multiple clusters).

Swap transformer blocks for small ordered 'learners' and run only as many as each token needs to cut inference cost with minimal accuracy, n

0.60
0.60
0.70
2

ACMs let you reduce inference compute and GPU latency while retaining model accuracy, enabling cheaper, faster deployment of pretrained transformers in latency- or energy-constrained settings.

Key finding

ACMized ViT-B achieves the Pareto frontier of FLOPs vs accuracy on ImageNet-1k.

Numbers: Advantage especially below 12.5 GFLOPs (Fig.3)

Reorder quantized weights to avoid shared-memory bank conflicts and speed up LLM inference up to ~1.9×

0.70
0.60
0.70
2

QUICK delivers 20–90%+ throughput improvements for batched LLM inference by eliminating shared-memory write stalls, lowering GPU cost per token and allowing larger batch inference using quantized models.

Key finding

QUICK reduces shared-memory bank conflicts that bottleneck mixed-precision GEMM during dequantization.

FlashInfer: a JIT‑compiled, block‑sparse attention engine that cuts LLM inference latency and supports custom attention variants

0.80
0.60
0.75
2

FlashInfer can cut inference latency and increase throughput in production LLM services, lowering GPU costs per query and improving user responsiveness.

Key finding

FlashInfer reduces inter‑token latency versus a Triton backend in LLM serving

Numbers: 2969% ITL reduction (Sec. 4, Abstract)

Fine-grained, weight-only quantization that cuts model size and boosts LLM throughput up to 3.65×

0.70
0.60
0.80
2

FineQuant reduces LLM memory and inference cost with little accuracy loss. That lowers GPU counts and raises throughput, directly cutting serving cost for large models on A100-class nodes.

Key finding

Adaptive fine-grained grouping prevents catastrophic accuracy drops from low-bit quantization.

Numbers: Recovered >94% lost BLEU by doubling granularity for four matrices (Section 3.3, Fig.3)

Flash-LLM: run sparsified LLMs on tensor cores with up to ~3–3.8× real inference speedups and lower GPU cost

0.80
0.65
0.80
1

Flash-LLM cuts inference GPU cost and increases throughput for production LLM serving by enabling practical unstructured sparsity on tensor cores.

Key finding

Flash-LLM speeds kernel SpMM 2–3.6× over Sputnik and 1.4–1.6× over SparTA depending on sparsity.

Numbers: avg 3.6×/1.4× at 70% sparsity; 3.0×/1.4× at 80%; 2.0×/1.6× at 90%

Reorder quantized weights to avoid inter-GPU communication and cut LLM inference latency up to ~1.8x

0.70
0.60
0.70
1

A low-complexity, offline reorder can cut inter-GPU communication and speed up quantized LLM inference, lowering latency and increasing throughput for multi-GPU serving without changing model weights.

Key finding

TP-Aware Dequantization speeds up MLP-layer inference in distributed LLMs.

Numbers: up to 1.81x (Llama-70B, A100) and up to 1.83x (Granite-20B, A100)

Use channel flattening to enable per-tensor INT4/INT8 math and halve compute time for large-batch LLM inference

0.70
0.60
0.80
1

FlattenQuant cuts compute time and GPU memory in large-batch/long-sequence inference by enabling INT4/INT8 TensorCore math, which can lower infrastructure cost and increase throughput when hardware supports INT4.

Key finding

FlattenQuant can convert roughly half of transformer linear layers to INT4 with small accuracy loss.

Numbers: 48.29% INT4 layers on OPT-30B (Table 4)

Automatically pick a cheapest mix of GPU types for an LLM service using profiling + an ILP bin-packing solver

0.65
0.50
0.80
1

Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.

Key finding

GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.

Numbers: A10G up to 2.6× T/$ over A100 for small requests; A100 up to 1.5× for large requests

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

0.60
0.60
0.80
1

Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.

Key finding

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

Numbers: up to 71% GPU reduction

Inf-MLLM: keep multimodal LLMs streaming on a single GPU by caching only recent + relevant tokens

0.70
0.60
0.70
1

Inf-MLLM lets you run continuous multimodal inference on a single GPU, cutting cloud costs and privacy exposure while avoiding out-of-memory failures for long videos and multi-round dialogs.

Key finding

Inf-MLLM enables stable language-model perplexity on extremely long text (up to 4 million tokens) and outperforms window/H2O/StreamingLLM on ranges tested.

Numbers: tested up to 4,000,000 tokens; better PPL than baselines up to 20K (Fig.5)

Harvest millisecond GPU idle cycles by slicing work into tokens, layers, and tiny KV checkpoints.

0.70
0.60
0.80
1

You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.

Key finding

ConServe reduces online tail latency while co-serving.

Numbers: P99 online latency reduced by up to 2.9× (avg reported in paper)

Use the host CPU to offload decoding attention and KV cache so GPUs can batch larger and give higher online throughput

0.70
0.60
0.80
1

NEO lets you squeeze more online throughput from existing GPU servers by using the host CPU, lowering per-token serving cost where GPU memory is the bottleneck.

Key finding

NEO raises throughput significantly on memory-limited GPUs.

Numbers: up to 7.5× on T4; 26% on A10G; 14% on H100 (reported maxima)

Transformer-Lite: run 2–10× faster LLM inference on phone GPUs via symbolic shapes, FP4, and KV-cache tricks

0.70
0.50
0.60
1

On-device LLM inference can cut cloud cost and latency while improving privacy; Transformer-Lite shows practical engineering steps to boost phone GPU throughput enough to make interactive mobile LLM apps feasible.

Key finding

Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.

Numbers: prefill >10×; decoding 2 (reported across Gemma 2B and ChatGLM2 6B)

Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

0.70
0.45
0.65
1

BAdam lets teams do full-parameter finetuning of 8B+ LLMs on single 24GB GPUs, cutting infrastructure cost and widening access to higher-quality fine-tuned models.

Key finding

BAdam reduces total GPU memory needed to finetune Llama 3-8B to ~23.5GB vs ~144.8GB+ for Adam.

Numbers: 23.5GB (BAdam) vs 144.8GB+ (Adam); Table 2

throttLL'eM: cut LLM inference energy by throttling GPU frequency and right-sizing instances while keeping latency SLOs

0.70
0.50
0.70
0

throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.

Key finding

Performance model predicts iteration throughput accurately.

Numbers: R2 ≥ 0.97; MAE < 1 IPS (Table 3)

Finetune 65B LLMs in 2/3/4-bit on a single consumer GPU by combining LoRA and modern quantizers

0.70
0.70
0.80
0

ModuLoRA lets teams finetune very large LLMs on commodity GPUs, cutting infrastructure cost and cycle time while preserving task performance.

Key finding

Run 65B finetuning on a single 24GB GPU in 2-bit precision

Numbers: 65B finetune in 2-bit on one RTX 3090 24GB (paper claim)

Speed up LLM serving by aggregating small models, adapting speculation length, and pipelining verification

0.70
0.65
0.70
0

Minions can materially reduce serving latency and multiply throughput for conversational LLMs without retraining large models, lowering operational cost and improving user responsiveness on evaluated workloads.

Key finding

Majority voting raises acceptance rates of SSM outputs, improving throughput.

Numbers: OPT-13B acceptance rates up to 0.87/0.89/0.78 (finance/chatbot/dialogue); Llama2-70B-chat ~0.54/0.49/0.55

Learn quantization grids that pay attention to loss sensitivity, enabling accurate 2–4-bit LLM compression at large scale

0.70
0.60
0.80
0

LeanQuant lets teams compress state-of-the-art LLMs to 2–4 bits with less accuracy loss and on common 48GB GPUs, cutting model memory and inference cost while remaining compatible with standard kernels.

Key finding

LeanQuant reduces task loss error and improves accuracy compared to GPTQ and other baselines in low-bit regimes.

Numbers: 3-bit Llama-3-8B avg zero-shot accuracy +18.38% vs GPTQ; +17.18% vs OmniQuant (Table 1)