27 papers found

PETALS: run and fine-tune 50B+ LLMs by pooling unreliable consumer GPUs over the Internet

0.75
0.60
0.80
13

PETALS lets teams share idle consumer GPUs to run 50B+ models interactively, cutting the need for expensive multi‑GPU servers and lowering inference latency versus RAM offloading; consider privacy and trust tradeoffs.

Key finding

Distributed approach (PETALS) gives big interactive speedups vs single‑GPU offloading.

Numbers: ≥10× faster for autoregressive generation (paper claim)

Serve thousands of LoRA adapters from one machine by paging adapters and batching LoRA compute

0.80
0.60
0.80
7

If you sell many small fine-tuned models (per-user or per-task), S-LoRA lets one machine host thousands of adapters, cutting GPU costs and raising throughput compared to naive merging or swapping.

Key finding

S-LoRA scales to thousands of adapters on one machine.

Numbers: Served 2,000 adapters on a single A100 (80GB) in experiments

Serve large LLMs on mixed-GPU clusters with phase-aware partitioning and adaptive mixed-precision quantization

0.70
0.60
0.80
3

If you run batched LLM workloads, LLM-PQ lets you use mixed low- and high-end GPUs together to significantly raise throughput and lower cost while keeping model quality.

Key finding

LLM-PQ consistently increases token-generation throughput versus state-of-the-art baselines by selecting mixed precisions and phase-aware partitions.

Numbers: Up to 2.88× speed-up; 2.26× average speed-up (Table 4, multiple clusters).

MemServe: a MemPool that adds context caching to disaggregated LLM serving, cutting job times and first-token delays

0.70
0.70
0.80
2

MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.

Key finding

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

Numbers: JCT improved up to 42% (P99) on ShareGPT

LLMEasyQuant: modular, hardware-aware quantization runtime for multi‑GPU and distributed LLM serving

0.80
0.60
0.80
2

LLMEasyQuant lowers calibration and deployment overhead, lets you fit larger models on the same GPUs, and delivers small steady throughput gains—helpful when you must amortize expensive GPU fleets.

Key finding

LLMEasyQuant achieves 2,156 tokens/s on LLaMA-7B with INT8 quantization.

Numbers: Throughput 2,156 tok/s (LLaMA-7B, 8K context)

Stream KV-cache to cut pipeline bubbles, reduce GPU memory, and recover fast for pipeline-parallel LLMs

0.75
0.60
0.70
1

DéjàVu cuts wasted GPU time, reduces memory needs, and shortens recovery after node failures—so you can serve larger LLMs cheaper and more reliably in pipeline-parallel clusters.

Key finding

Disaggregation increases throughput versus a pipeline-parallel baseline

Numbers: Up to throughput improvement vs FasterTransformer on OPT-66B and BLOOM-176B

Reorder quantized weights to avoid inter-GPU communication and cut LLM inference latency up to ~1.8x

0.70
0.60
0.70
1

A low-complexity, offline reorder can cut inter-GPU communication and speed up quantized LLM inference, lowering latency and increasing throughput for multi-GPU serving without changing model weights.

Key finding

TP-Aware Dequantization speeds up MLP-layer inference in distributed LLMs.

Numbers: up to 1.81x (Llama-70B, A100) and up to 1.83x (Granite-20B, A100)

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

0.60
0.60
0.80
1

Aladdin can cut GPU spending by tens of percent while keeping token-level SLOs, turning large inference clusters from a fixed-cost bottleneck into a more efficient, demand-driven service.

Key finding

Aladdin cuts required GPUs by up to 71% vs. default vLLM for the same SLOs in simulated high-demand workloads.

Numbers: up to 71% GPU reduction

Microserving APIs and unified KV cache to reprogram LLM serving and cut job completion time by up to 47%

0.70
0.60
0.70
0

Microserving lets ops teams reprogram serving coordination at the router level without restarting engines, cutting tail latency and compute waste on long-input workloads and enabling live tuning of prefill/decode balance.

Key finding

Balanced prefill-decode disaggregation cuts tail job time on long inputs.

Numbers: P99 JCT reduced up to 47% (synthetic long-input dataset)

Reduce KV memory and bandwidth for MoE inference by sharding, routing, compression, and adaptive scheduling

0.60
0.60
0.70
0

If you serve MoE models or long-context LLMs, KV cache memory and network traffic are major cost drivers. PiKV helps cut per-GPU memory and inter-GPU bandwidth by sharding, selective access, and compression—reducing infrastructure needs or enabling longer contexts on the same cluster.

Key finding

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

Numbers: per-device memory: O(E·L) -> O(L/G + L/E) (Section 3.1 analytic)

Cut model-reload downtime by preserving GPU state and doing small P2P migrations

0.70
0.70
0.75
0

AnchorTP cuts recovery downtime from tens of seconds or minutes to a few seconds and shortens time to regain peak throughput. That improves user-facing latency SLOs, reduces required redundancy, and lowers cost compared to always-running replicas.

Key finding

AnchorTP reduces time-to-first-success (TFS) by about 10× on evaluated models compared to restart-and-reload elastic TP.

Numbers: Qwen3-30B-A3B: 4.5s vs 48.4s (≈10.8×) at 25% failure point

APEX: fast, extensible simulator that finds cost- and energy-efficient parallel plans for LLM serving

0.70
0.60
0.80
0

APEX lets ops teams find faster or cheaper LLM serving configurations without burning expensive GPU hours, enabling targeted trade-offs between latency, throughput, and energy while meeting SLOs.

Key finding

APEX prediction fidelity is high.

Numbers: average relative error = 10.7%

Faster CPU inference: SlimAttention, INT8 KV cache, and oneCCL-based distributed serving

0.70
0.50
0.70
0

This paper shows practical, deployable techniques to run big LLMs on commodity x86 servers. That reduces reliance on GPUs, lowers memory barriers for long contexts and large batches, and can cut latency by multi-socket scaling patterns.

Key finding

SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.

Numbers: Input=1024: Flash 61.57 ms vs Slim 16.02 ms (per layer, first token)

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

0.80
0.45
0.80
0

ThunderServe lets you run more LLM replicas for the same cloud spend by mapping compute-heavy and memory-bound phases to different GPU types and by cutting KV transfer costs—so you can increase throughput and meet SLOs without buying only high-end GPUs.

Key finding

Throughput gains over heterogeneous-cloud baseline

Numbers: up to 2.1×, average 1.7× (throughput) vs state-of-the-art on tested cloud setup

Make decentralized LLM markets pay for quality per cost, not just raw accuracy

0.60
0.45
0.80
0

If you run or buy decentralized LLM inference, paying for raw accuracy alone wastes money. Cost-aware PoQ rewards quality per cost, pushing demand toward nodes that give the best quality per latency/compute. That reduces marketplace waste and lets small evaluators compete with big models.

Key finding

A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.

Numbers: Pearson r ≈ 0.66 vs F1; ≈ 0.29 vs GPT

Run full LLM inference on one wafer-scale chip; up to 10–600× speedups vs GPUs

0.70
0.65
0.80
0

Wafer-scale chips can cut per-request latency and tokens-per-dollar for long outputs and high-throughput serving, making them worth testing for production LLM serving and cost-sensitive, long-context workloads.

Key finding

WaferLLM achieves far higher accelerator utilization than prior methods.

Numbers: up to 200× accelerator utilization vs SOTA methods

Jointly place LLM layers on edge servers and quantize them to cut latency and memory while keeping accuracy.

0.40
0.50
0.60
0

If you serve LLMs from edge nodes, joint per-layer placement and quantization can cut weight storage and network load by ~87% while keeping task accuracy almost unchanged, which lowers cost and reduces latency for latency-sensitive apps.

Key finding

DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).

Numbers: quantization ratio = 12.50% (table rows δ=0.01,0.1,1.0)

Model‑agnostic hybrid sharding to run large models across heterogeneous, privacy-preserving nodes

0.50
0.60
0.70
0

BSNS lowers hardware and bandwidth barriers so companies can run large models across existing, heterogeneous machines while keeping user data private and model execution auditable.

Key finding

Switching model communication and weights from 16‑bit to 8‑bit had negligible task drop on evaluated NLP benchmarks.

Numbers: HellaSwag Llama‑8B: 0.760.76; Mixtral 7x8B: 0.780.77 (Table 1)

A runtime-driven simulator that models heterogeneous accelerators, disaggregated memory, batching, and power for realistic LLM serving

0.70
0.60
0.70
0

Run realistic what-if tests for heterogeneous hardware and disaggregated serving to choose cheaper or more energy-efficient deployments before costly hardware changes.

Key finding

Simulator reproduces key serving metrics with very low average error across evaluated setups.

Numbers: Average error 0.97% across throughput, latency, memory, and power

Move KV-cache fetching and decompression off GPUs to SmartNICs to eliminate interference

0.70
0.55
0.60
0

If you serve LLMs over limited network links or on low-bandwidth GPU instances, offloading KV-cache fetch and decompression to SmartNICs can cut per-token latency and improve throughput without changing compression code.

Key finding

Offloading decompression to the SmartNIC cuts per-output-token latency under load.

Numbers: 1.062.19× lower loaded TPOT across configs; up to 2.2× reported

DualMap: dual-hash scheduling that preserves KV-cache reuse while balancing load for LLM serving

0.70
0.60
0.70
0

DualMap can serve more latency-sensitive requests and lower per-request compute cost by combining cache reuse with balanced load, improving throughput and reducing tail latency under real skewed workloads.

Key finding

DualMap increases effective request capacity up to 2.25× versus state-of-the-art schedulers on evaluated traces.

Numbers: up to 2.25× effective request capacity (abstract, §5)

Flying Serving: instant DP↔TP switching to cut tail latency, keep throughput, and scale context

0.80
0.70
0.70
0

Flying Serving reduces tail latency during bursts and keeps throughput near DP, letting operators serve mixed-priority and long-context traffic without costly restarts or wasted GPUs.

Key finding

Live DP↔TP switching reduces burst P90 TTFT up to 4.79× vs static TP on tested models.

Numbers: P90 TTFT reduction up to 4.79× (Nemotron-8B) under bursty trace

Open-source, low-cost platform that secures RAG chatbots for small businesses using k3s clusters and layered prompt-defences

0.78
0.48
0.80
0

Small businesses can run secure, low-cost RAG chatbots on commodity hardware while keeping strong tenant isolation and practical defenses against prompt injection.

Key finding

Guard prompts block prompt-injection attacks almost perfectly in the case study.

Numbers: Recall 99.6100%, F1 ~100% (Table 1)

Share the prefill and KV cache across fine‑tuned models to cut tail latency and boost throughput in multi‑model agent serving.

0.75
0.65
0.80
0

If your product runs multiple fine‑tuned models over shared prompts (agents, planners, coders), PrefillShare can cut tail latency and GPU cost by reusing one prefill and KV cache across models while keeping task accuracy.

Key finding

PrefillShare matches full fine‑tuning accuracy across math, coding, and tool‑calling benchmarks.

Numbers: Accuracy within ≈1% of Full‑FT on evaluated benchmarks (Table 1).