PETALS: run and fine-tune 50B+ LLMs by pooling unreliable consumer GPUs over the Internet

December 13, 20237 min

Overview

Production Readiness

0.75

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

13

Authors

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel

Links

Abstract / PDF

Why It Matters For Business

PETALS lets teams share idle consumer GPUs to run 50B+ models interactively, cutting the need for expensive multi‑GPU servers and lowering inference latency versus RAM offloading; consider privacy and trust tradeoffs.

Summary TLDR

This paper introduces PETALS, a decentralized system and algorithms that let you run and fine-tune very large language models (50B+ params) by pooling consumer GPUs over the Internet. Key ideas: a fault‑tolerant pipeline-parallel inference algorithm (dual client/server caches), a decentralized load balancer that assigns contiguous transformer blocks to servers, and support for parameter‑efficient fine-tuning. In experiments PETALS runs Llama 2 (70B) and BLOOM (176B) across geo-distributed machines and reports ≥10× speedups versus single-GPU RAM offloading for interactive generation, while using quantization to cut memory and bandwidth needs.

Problem Statement

Large LLMs (50B+ params) need expensive multi‑GPU servers. Offloading parameters to RAM or SSD is slow for interactive use. The paper tackles running inference and parameter‑efficient fine-tuning on many unreliable, heterogeneous, geo-distributed consumer GPUs while handling node disconnections and uneven hardware.

Main Contribution

A fault‑tolerant pipeline‑parallel inference algorithm using dual caches (server-side and client-side) that recovers from server disconnects without restarting generation.

A fully decentralized load‑balancing protocol that assigns contiguous transformer blocks to servers to maximize tokens/sec under churn.

PETALS system implementing these algorithms, demonstrating practical runs of Llama 2 (70B) and BLOOM (176B) over the Internet.

Design and evaluation of parameter‑efficient fine‑tuning (clients store adapters/soft prompts) and use of quantization to reduce memory/bandwidth.

Key Findings

Distributed approach (PETALS) gives big interactive speedups vs single‑GPU offloading.

Numbers≥10× faster for autoregressive generation (paper claim)

Algorithm 1 retains throughput under server failures while naive caching fails.

NumbersBLOOM-7.1B, 128 tokens, failure rate 0.01: Algorithm1 3.38 steps/s vs caching+restarts 0.18 steps/s

Greedy decentralized load balancing is near-optimal in practice.

NumbersFinds 90–100% of optimal throughput in simulations

Weight quantization has minimal quality cost while reducing memory.

NumbersBLOOM zero‑shot avg: 16-bit 70.1 vs 8-bit 70.3 (accuracy points)

Results

Sequential inference (steps/s)

ValueLlama 2 (70B) on 3×T4: 2.29 steps/s (128 tokens, 1 Gbit/s, <5 ms RTT)

BaselineOffloading: 0.139 steps/s

Sequential inference (steps/s)

ValueBLOOM (176B) on 3×A100: 1.71 steps/s (128 tokens, 1 Gbit/s, <5 ms RTT)

BaselineOffloading theoretical: 0.0495 steps/s

Parallel forward throughput (tokens/s)

ValueBLOOM (176B) on 3×A100: 70 tokens/s (batch 1×128)

BaselineOffloading: 2.5 tokens/s (example row)

Who Should Care

What To Try In 7 Days

Install PETALS and run a small public model on a home lab to see pipeline behavior.

Benchmark a 7B/70B model over your network vs RAM offloading to quantify speedup.

Enable 8-bit or 4-bit quantization and compare quality on a few downstream tasks (zero-shot checks).

Optimization Features

Token Efficiency

  • client-side caching reduces per-step data to kilobytes

Infra Optimization

  • use of volunteer/spot GPUs to lower hardware costs

Model Optimization

  • 8-bit matrix decomposition
  • 4-bit NormalFloat

System Optimization

  • decentralized greedy load balancing
  • shortest-path routing (D* Lite) for chain selection

Training Optimization

  • parameter-efficient fine-tuning (adapters, soft prompts)
  • gradient checkpointing

Inference Optimization

  • pipeline parallelism across servers
  • dynamic blockwise quantization of activations

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Servers holding early model blocks can see client inputs, so privacy is a concern.
  • Malicious or faulty servers may return incorrect outputs; validators are proposed but not fully deployed.
  • System depends on enough volunteers; lack of supply hurts throughput.
  • Real-world heterogeneous setups are slower than ideal NVLink clusters; performance varies with network latency.

When Not To Use

  • Handling highly sensitive data without trusted or private peers.
  • When you need the absolute lowest single-node latency (NVLink/local multi‑GPU).
  • Small models that already fit on one GPU—no need for distributed setup.

Failure Modes

  • Servers returning incorrect results (malicious or broken)
  • High network latency or low bandwidth reducing interactive performance
  • Insufficient number of serving peers causes bottlenecks
  • Quantization edge cases slightly affecting model outputs

Core Entities

Models

  • Llama 2 (70B)
  • BLOOM (176B)
  • BLOOM (7.1B)

Metrics

  • steps/s
  • tokens/s
  • failure rate

Context Entities

Models

  • OPT-175B