Serve thousands of LoRA adapters from one machine by paging adapters and batching LoRA compute

November 6, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

7

Authors

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

Links

Abstract / PDF

Why It Matters For Business

If you sell many small fine-tuned models (per-user or per-task), S-LoRA lets one machine host thousands of adapters, cutting GPU costs and raising throughput compared to naive merging or swapping.

Summary TLDR

S-LoRA is a system for serving very large numbers of LoRA adapters (small fine-tuned parameter sets) from a single machine. It keeps adapters in host memory, moves only active adapters to GPU, and combines a unified memory pool, prefetching, custom CUDA kernels, and a new tensor-parallel scheme to batch and parallelize LoRA computation. On Llama models S-LoRA serves up to ~2,000 adapters on one GPU and reports up to 4× throughput vs. naive vLLM-packed and up to 30× vs. HuggingFace PEFT on evaluated workloads. The design trades a small on-the-fly LoRA compute cost for much larger batching gains and host-memory scalability.

Problem Statement

Serving many task-specific LoRA adapters is inefficient with standard methods. Merging adapters into full models or naively swapping weights wastes GPU memory, fragments KV cache memory, prevents batching across adapters, and sharply limits the number of adapters per machine.

Main Contribution

Unified Paging: a unified paged memory pool that stores KV caches and adapter weights together to reduce fragmentation and enable many adapters in host memory.

Heterogeneous batching kernels: custom CUDA/Triton kernels (MBGMM/MBGMV) that batch LoRA ops across variable ranks and sequence lengths in non-contiguous memory.

S-LoRA tensor-parallel strategy: a partitioning and communication schedule that keeps extra LoRA communication negligible vs. the base model and scales across GPUs.

Practical serving features: adapter clustering, demand-based prefetching, and an early-abort admission control to protect SLOs.

Key Findings

S-LoRA scales to thousands of adapters on one machine.

NumbersServed 2,000 adapters on a single A100 (80GB) in experiments

Throughput improvements vs. baselines.

NumbersUp to 4× vs vLLM-packed and up to 30× vs HuggingFace PEFT on evaluated traces

Custom kernels and unified memory reduce fragmentation and latency overhead.

NumbersS-LoRA outperforms S-LoRA-no-unify-mem and S-LoRA-bmm across adapter counts (higher throughput, lower latency)

LoRA communication is small under new TP strategy.

NumbersAdded LoRA communication is negligible compared to base-model communication; throughput increases >2× from 2→4 GPUs due

Results

throughput (requests/s) on Llama-7B, single A100 (80GB), small adapter set

ValueS-LoRA 8.05 req/s; vLLM-packed 2.04 req/s; PEFT 0.88 req/s

BaselinevLLM-packed

scale of adapters served

ValueSuccessfully served 2,000 adapters on a single GPU (A100 80GB) with small overhead

relative improvement vs. PEFT across experiments

ValueUp to 30× higher throughput on evaluated workloads

BaselineHuggingFace PEFT

Who Should Care

What To Try In 7 Days

Run the S-LoRA GitHub demo on a Llama-7B instance and replay a small real trace to measure throughput gains.

Measure host RAM vs GPU usage for your adapter set; confirm adapters fit host memory before deploying.

Prototype unified-paging: store adapters off-GPU and prefetch a hottest-set to test latency impact.

Optimization Features

Infra Optimization

  • LoRA
  • Non-contiguous memory-aware CUDA/Triton kernels to avoid padding overheads

System Optimization

  • Unified Paging: shared paged memory pool for KV cache and adapters
  • Adapter clustering and early-abort admission control to protect SLOs

Inference Optimization

  • LoRA
  • Prefetch adapters to overlap I/O with compute

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires host RAM large enough to hold all adapters; limited by main memory capacity.
  • Adds on-the-fly LoRA compute cost (xAB) compared to merged weights; trade-off depends on workload mix.
  • Relies on custom CUDA/Triton kernels and modified TP which increase engineering complexity.
  • Evaluation focuses on LoRA; other adapter methods and broader model families are not extensively tested.

When Not To Use

  • If you serve only one or a few adapters with strict single-adapter latency—merging may be faster.
  • When host memory is insufficient to hold the adapter catalog.
  • If you cannot deploy custom kernels or modify tensor-parallel runtime.

Failure Modes

  • High adapter churn and poor prefetch predictions can cause I/O stalls and increased latency.
  • Insufficient host memory leads to inability to host many adapters or higher paging overhead.
  • Incorrect admission-control tuning may drop requests unfairly or reduce user satisfaction.
  • Kernel or partitioning bugs can produce incorrect computations or severe performance regressions.

Core Entities

Models

  • Llama-7B
  • Llama-13B
  • Llama-30B
  • Llama-70B

Metrics

  • throughput
  • average request latency
  • first token latency
  • SLO attainment
  • user satisfaction

Context Entities

Models

  • Megatron-LM partitioning (used as baseline TP design)

Metrics

  • throughput
  • SLO attainment