52 papers found

PETALS: run and fine-tune 50B+ LLMs by pooling unreliable consumer GPUs over the Internet

0.75
0.60
0.80
13

PETALS lets teams share idle consumer GPUs to run 50B+ models interactively, cutting the need for expensive multi‑GPU servers and lowering inference latency versus RAM offloading; consider privacy and trust tradeoffs.

Key finding

Distributed approach (PETALS) gives big interactive speedups vs single‑GPU offloading.

Numbers: ≥10× faster for autoregressive generation (paper claim)

INT4 (4-bit) gives big latency wins for encoder models with little accuracy loss, but breaks decoder-only generators; optimized INT4 kernels

0.70
0.40
0.80
6

On Ampere GPUs, INT4 computation can sharply reduce latency and cost for encoder-based workloads (search, classification, embedding). But it is risky to use for autoregressive generation (chatbots, text generation) until activation-quantization problems are solved.

Key finding

Encoder models (BERT) keep accuracy under W4A4 QAT+KD.

Numbers: BERT-base MNLI 84.20 (FP32) → 84.31 (W4A4 symmetric)

Cut serverless LLM cold-starts 10–200× by local checkpoint formats, token-only live migration, and startup-aware scheduling

0.70
0.60
0.80
6

Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.

Key finding

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

Numbers: 3.68.2× faster loading (OPT-2.7B to LLaMA-2-70B)

DeepSpeed-FastGen: up to 2.3x effective throughput and much lower tail latency for LLM serving

0.80
0.45
0.65
5

FastGen can double effective throughput and sharply cut tail latency on long-prompt, streaming chat workloads, which lowers GPU cost per successful request and improves user-perceived responsiveness.

Key finding

Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.

Numbers: up to 2.3x effective throughput (Section 4.3)

EdgeTran co-designs transformer models and edge devices to cut latency, energy and peak power for mobile inference

0.60
0.60
0.70
3

Co-designing model and device cuts operational energy and peak power by an order of magnitude while keeping or slightly improving accuracy — lowering battery drain, thermal limits and cloud costs for on-device NLP.

Key finding

Final co-designed model (ET*) is 2.8× smaller than BERT-Base and improves GLUE by 0.8 percentage points

Numbers: 39.6M vs 110M params; GLUE 80.4% vs 79.6%

Practical survey linking Vision Transformer quantization methods to hardware accelerators

0.70
0.30
0.80
3

Quantizing ViTs to 8-bit often preserves accuracy while halving memory and improving throughput on INT8-capable hardware, enabling real-time and edge deployment with lower cost.

Key finding

8-bit quantization typically keeps near-original ImageNet accuracy for DeiT-Base.

Numbers: PTQ methods: 81.2082.67 vs FP32 81.85 Top-1

MemServe: a MemPool that adds context caching to disaggregated LLM serving, cutting job times and first-token delays

0.70
0.70
0.80
2

MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.

Key finding

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

Numbers: JCT improved up to 42% (P99) on ShareGPT

Train the controller to shorten the critical execution path so parallel agent teams run much faster without losing accuracy

0.60
0.60
0.70
2

When you run multiple LLM-based agents in parallel, overall response time depends on the slowest chain of steps (the critical path). Training the orchestration policy to minimize that path reduces latency a lot without sacrificing accuracy, which helps interactive products and time-sensitive workflows.

Key finding

LAMaS reduced critical-path length substantially compared to MaAS on three benchmarks.

Numbers: CP len reduced by 38.0% (GSM8K), 42.4% (HumanEval), 46.1% (MATH)

Save and reuse attention KV caches across turns to cut LLM serving latency and cloud cost

0.80
0.60
0.85
1

If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.

Key finding

Time-to-first-token (TTFT) drops dramatically when cached KV hits occur

Numbers: TTFT reduced by up to 87% (Figure 14)

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

0.60
0.60
0.80
1

Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.

Key finding

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

Numbers: up to 71% GPU reduction

MC-SF: memory-aware online batching that cuts LLM latency under KV-cache limits

0.70
0.60
0.70
1

Better online scheduling of token decoding can cut average latency and GPU-hours: simulations show ~31% lower latency versus a memory-aware baseline, implying fewer GPUs and lower energy costs.

Key finding

Deterministic online algorithms can perform arbitrarily worse in the worst-case.

Numbers: competitive ratio ≥ Ω(√n) (Theorem 4.1)

NinjaLLM: cost-focused RAG on AWS Trainium with near‑GPT‑4 accuracy

0.70
0.50
0.80
1

You can deploy competitive RAG QA at lower cloud cost by using Trainium/Inferentia2 plus serving optimizations, enabling cheaper fine-tuning and elastic scaling for production assistants.

Key finding

NinjaLLM matches or exceeds several open LLM baselines on two QA benchmarks.

Numbers: NQ 62.22%, HotPotQA 58.84% (Table 1)

throttLL'eM: cut LLM inference energy by throttling GPU frequency and right-sizing instances while keeping latency SLOs

0.70
0.50
0.70
0

throttLL'eM can lower GPU energy costs by tens of percent while keeping latency SLOs, reducing operating cost and carbon footprint for LLM inference services.

Key finding

Performance model predicts iteration throughput accurately.

Numbers: R2 ≥ 0.97; MAE < 1 IPS (Table 3)

Microserving APIs and unified KV cache to reprogram LLM serving and cut job completion time by up to 47%

0.70
0.60
0.70
0

Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.

Key finding

Balanced prefill-decode disaggregation cuts tail job time on long inputs.

Numbers: P99 JCT reduced up to 47% (synthetic long-input dataset)

OrbitFlow adaptively reconfigures per-request KV cache placements to meet token-level latency SLOs for long-context LLM serving

0.80
0.70
0.70
0

OrbitFlow reduces token-level latency violations and raises throughput for long-context LLM services, improving user-perceived responsiveness and allowing more requests per GPU under real workloads.

Key finding

Solver-driven, per-request placement substantially improves SLO attainment.

Numbers: TPOT +62% and TBT +66% SLO attainment (evaluated traces)

Precompute table-level KV caches (guided by primary–foreign keys) to cut Text-to‑SQL prefill latency up to 3.62× while keeping accuracy.

0.70
0.60
0.70
0

TableCache cuts Text-to‑SQL response latency by precomputing and reusing table caches, improving user experience and lowering repeated GPU compute costs in applications where users query shared tables.

Key finding

TableCache greatly reduces prefix latency (TTFT) on Text-to‑SQL benchmarks.

Numbers: up to 3.62× TTFT speedup (reported max)

Speed up LLM serving by aggregating small models, adapting speculation length, and pipelining verification

0.70
0.65
0.70
0

Minions can materially reduce serving latency and multiply throughput for conversational LLMs without retraining large models, lowering operational cost and improving user responsiveness on evaluated workloads.

Key finding

Majority voting raises acceptance rates of SSM outputs, improving throughput.

Numbers: OPT-13B acceptance rates up to 0.87/0.89/0.78 (finance/chatbot/dialogue); Llama2-70B-chat ~0.54/0.49/0.55

Halve embedding size and prune heads to cut transformer memory and latency for edge devices

0.40
0.40
0.60
0

Cutting model size and latency lets teams run transformers on phones and edge devices, reducing server costs and improving responsiveness.

Key finding

Memory footprint roughly halved versus the original transformer.

Numbers: 1,122,304536,576 bytes (−52%)

Joint Hessian-aware 8-bit quantization plus CPU–GPU expert scheduling for MoE edge deployment

0.70
0.60
0.80
0

HAQ plus CPU–GPU scheduling lets teams run large MoE models on consumer GPUs with near-full-precision accuracy, lowering GPU costs and enabling predictable, lower-latency edge services.

Key finding

HAQ matches full-precision perplexity closely on Mixtral-8×7B

Numbers: Wikitext2: FP16 3.840 vs HAQ 3.864; C4: FP16 7.401 vs HAQ 7.427

Prune far-away masks and stop confident tokens early to make diffusion LLMs much faster at inference

0.75
0.65
0.85
0

Streaming-dLLM cuts inference compute and latency dramatically for diffusion LLMs without retraining. That reduces cloud GPU cost and improves responsiveness for production services that use dLLMs for long or batch generation.

Key finding

Large throughput gains while preserving task accuracy.

Numbers: 68.2× speedup on MBPP with LLaDA-1.5 (gen length 512); accuracy 38.4%

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

0.80
0.45
0.80
0

ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.

Key finding

Throughput gains over heterogeneous-cloud baseline

Numbers: up to 2.1×, average 1.7× (throughput) vs state-of-the-art on tested cloud setup

HydraServe cuts serverless LLM cold starts by parallel fetch, overlap, and consolidation

0.85
0.60
0.70
0

HydraServe makes serverless LLMs more reliable by cutting time-to-first-token and raising SLO attainment for bursty, long-tail workloads while keeping costs similar or lower.

Key finding

HydraServe reduces cold-start TTFT substantially versus prior serverless systems.

Numbers: 1.7×–4.7× TTFT reduction on evaluated testbeds

OD-LLM: SVD + token normalization to run LLM recommenders on-device at half size

0.70
0.60
0.70
0

OD-LLM trims LLM memory in half while keeping ranking quality and cutting inference time, making on-device personalization feasible for latency- and privacy-sensitive apps.

Key finding

50% compressed OD-LLM matches or exceeds uncompressed LC-Rec ranking metrics on evaluated datasets.

Numbers: Instruments HR@5: OD-LLM 0.0993 vs LC-Rec 0.0997; Arts HR@5: OD-LLM 0.1173 vs LC-Rec 0.1007

Route each token to a small or large model to cut memory movement and speed up LLM decoding

0.70
0.70
0.80
0

CITER lowers memory-transfer costs and latency during decoding by routing only the important tokens to the big model, cutting cloud/GPU bill and improving realtime responsiveness while keeping quality.

Key finding

CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.

Numbers: Up to 27% fewer data-transfers (Qwen family); up to 32% on Llama3.1