Serve thousands of LoRA adapters from one machine by paging adapters and batching LoRA compute

November 6, 20238 min

Overview

Decision SnapshotNeeds Validation

The system shows clear throughput and scale gains on Llama models and real traces; code is available, but deploying requires custom kernels and memory changes.

Citations7

Evidence Strength0.80

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

Links

Abstract / PDF / Code

Why It Matters For Business

If you sell many small fine-tuned models (per-user or per-task), S-LoRA lets one machine host thousands of adapters, cutting GPU costs and raising throughput compared to naive merging or swapping.

Who Should Care

Summary TLDR

S-LoRA is a system for serving very large numbers of LoRA adapters (small fine-tuned parameter sets) from a single machine. It keeps adapters in host memory, moves only active adapters to GPU, and combines a unified memory pool, prefetching, custom CUDA kernels, and a new tensor-parallel scheme to batch and parallelize LoRA computation. On Llama models S-LoRA serves up to ~2,000 adapters on one GPU and reports up to 4× throughput vs. naive vLLM-packed and up to 30× vs. HuggingFace PEFT on evaluated workloads. The design trades a small on-the-fly LoRA compute cost for much larger batching gains and host-memory scalability.

Problem Statement

Serving many task-specific LoRA adapters is inefficient with standard methods. Merging adapters into full models or naively swapping weights wastes GPU memory, fragments KV cache memory, prevents batching across adapters, and sharply limits the number of adapters per machine.

Main Contribution

Unified Paging: a unified paged memory pool that stores KV caches and adapter weights together to reduce fragmentation and enable many adapters in host memory.

Heterogeneous batching kernels: custom CUDA/Triton kernels (MBGMM/MBGMV) that batch LoRA ops across variable ranks and sequence lengths in non-contiguous memory.

Key Findings

S-LoRA scales to thousands of adapters on one machine.

NumbersServed 2,000 adapters on a single A100 (80GB) in experiments

Practical UseStore adapters in host RAM and page needed ones to GPU to support many user-specific or task-specific models without replicating the full base model.

Evidence RefTable 3; Section 7.2

Throughput improvements vs. baselines.

NumbersUp to vs vLLM-packed and up to 30× vs HuggingFace PEFT on evaluated traces

Practical UseUse S-LoRA-style batching and memory management to materially increase requests/sec when serving many adapters.

Evidence RefAbstract; Table 3; Section 7.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
throughput (requests/s) on Llama-7B, single A100 (80GB), small adapter setS-LoRA 8.05 req/s; vLLM-packed 2.04 req/s; PEFT 0.88 req/svLLM-packed≈4× vs vLLM-packed; ≈9× vs PEFT in this settingTable 3, S1 settingTable 3 reports raw throughput values for S1 (n small).Table 3
scale of adapters servedSuccessfully served 2,000 adapters on a single GPU (A100 80GB) with small overheadSynthetic workloads (Section 7.2)Table 3 and Section 7.2 show S-LoRA serving 2000 adapters.Table 3

What To Try In 7 Days

Run the S-LoRA GitHub demo on a Llama-7B instance and replay a small real trace to measure throughput gains.

Measure host RAM vs GPU usage for your adapter set; confirm adapters fit host memory before deploying.

Prototype unified-paging: store adapters off-GPU and prefetch a hottest-set to test latency impact.

Optimization Features

Infra Optimization
LoRANon-contiguous memory-aware CUDA/Triton kernels to avoid padding overheads
System Optimization
Unified Paging: shared paged memory pool for KV cache and adaptersAdapter clustering and early-abort admission control to protect SLOs
Inference Optimization
LoRAPrefetch adapters to overlap I/O with compute

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Requires host RAM large enough to hold all adapters; limited by main memory capacity.

Adds on-the-fly LoRA compute cost (xAB) compared to merged weights; trade-off depends on workload mix.

When Not To Use

If you serve only one or a few adapters with strict single-adapter latency—merging may be faster.

When host memory is insufficient to hold the adapter catalog.

Failure Modes

High adapter churn and poor prefetch predictions can cause I/O stalls and increased latency.

Insufficient host memory leads to inability to host many adapters or higher paging overhead.

Core Entities

Models

Llama-7BLlama-13BLlama-30BLlama-70B

Metrics

throughputaverage request latencyfirst token latencySLO attainmentuser satisfaction

Context Entities

Models

Megatron-LM partitioning (used as baseline TP design)

Metrics

throughputSLO attainment