Model‑agnostic hybrid sharding to run large models across heterogeneous, privacy-preserving nodes

July 29, 20248 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Claudio Angione, Yue Zhao, Harry Yang, Ahmad Farhan, Fielding Johnston, James Buban, Patrick Colangelo

Links

Abstract / PDF

Why It Matters For Business

BSNS lowers hardware and bandwidth barriers so companies can run large models across existing, heterogeneous machines while keeping user data private and model execution auditable.

Summary TLDR

This paper presents BSNS, a system that shards neural networks into contiguous blocks and runs them across diverse nodes (including consumer GPUs) using blockchain-aware routing and topology signals. Key ingredients: persistent homology to pick precomputed sharding schemes, a genetic optimizer for routing hyperparameters, KV caching to avoid recomputation, dynamic blockwise quantization and mixed matrix decomposition to shrink transfers and memory, and a stacked security layer (TEEs, CDV, ZKML, Split Learning, and a proposed Sequential Vector Encryption). Benchmarks show 16→8 bit compression gives negligible drops on several language tasks, and token throughput increases with batching in a 6

Problem Statement

Large models are expensive and centralized inference raises privacy, cost and single‑point‑of‑failure concerns. Running state‑of‑the‑art models on many low‑power or geographically distributed machines needs automated sharding, low‑bandwidth transfers, and verifiable privacy — all without breaking model quality.

Main Contribution

BSNS: blockchain‑aware sequential sharding that maps contiguous model blocks to node chains using network topology and heuristics.

Topology‑aware routing: persistent homology features + DHT and a BRKGA (biased random‑key genetic algorithm) to pick near‑optimal precomputed shardings.

Efficiency stack: KV cache for transformer tokens, dynamic blockwise quantization, and mixed matrix decomposition (8‑bit with selective 16‑bit retention).

PEFT support and in‑network fine‑tuning: LoRA/adapters per shard with synchronized updates and ZeRO optimizer for memory splitting.

Security/privacy stack: TEEs + Consensus Distribution Verification (CDV), Split Learning (SL), Zero‑Knowledge ML for small private models, and Sequential Vector Encryption (SVE).

Key Findings

Switching model communication and weights from 16‑bit to 8‑bit had negligible task drop on evaluated NLP benchmarks.

NumbersHellaSwag Llama‑8B: 0.76 → 0.76; Mixtral 7x8B: 0.78 → 0.77 (Table 1)

Batching and network quality materially increase token throughput in a 6‑node swarm.

NumbersLlama‑3 8B tokens/sec increased from 8 to 56 under 1 Gbit/s and <5 ms RTT (Table 2)

Mixed matrix decomposition reduced GPU storage needs by roughly 30% for a large Mixtral variant in the authors' tests.

Numbers≈30% GPU requirement reduction (Section 4)

The system pairs hardware TEEs with algorithmic verification for combined privacy and model verification.

Results

Accuracy

ValueLlama‑8B: 0.76 (16‑bit) → 0.76 (8‑bit)

Baseline16‑bit precision

Accuracy

ValueMixtral 7x8B: 0.78 (16‑bit) → 0.77 (8‑bit)

Baseline16‑bit precision

Tokens per second (throughput)

ValueLlama‑3 8B under 1 Gbit/s & <5 ms RTT: 8 → 28 → 56 tokens/sec

Baselinesmall batch / poor network

Memory / GPU requirement reduction

Value≈30% reduction for Mixtral 8x22B with mixed matrix decomposition

Baselinefull precision storage

Who Should Care

What To Try In 7 Days

Quantize a copy of a production model to 8‑bit and validate key tasks to measure accuracy impact similar to Table 1.

Prototype a 2–4 node sharded pipeline for a small transformer (split by blocks) and measure tokens/sec vs single‑node baseline.

Collect simple network‑topology features (latency, bandwidth, uptime) and try a basic genetic optimizer to pick node order for shards.

Optimization Features

Token Efficiency

  • KV cache reduces recomputation for token generation

Infra Optimization

  • Support for consumer GPUs and CPU fallback
  • Reduced data transfer via quantization and mixed decomposition

Model Optimization

  • Mixed matrix decomposition (selective 8‑bit with critical 16‑bit retention)
  • Dynamic blockwise quantization (16→8 bit transfers)
  • LoRA

System Optimization

  • Persistent homology for topology fingerprints
  • DHT for routing and node discovery
  • Dynamic rebalancing to avoid bottlenecks

Training Optimization

  • ZeRO optimizer for state sharding
  • Adapter synchronization across nodes
  • Split learning for private training

Inference Optimization

  • KV cache to avoid repeated attention work
  • Dynamic sequential block sharding (BSNS)
  • BRKGA for routing/hyperparameter tuning
  • Precomputed sharding schemas for common topologies

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Sharding optimality is NP‑hard; precomputation and heuristics trade optimality for speed.
  • ZKML is acknowledged as costly and used only for small private models.
  • Security claims depend on TEEs and consensus mechanisms whose overheads are not fully benchmarked.
  • Quantization and decomposition were evaluated on selected tasks; edge cases may see larger quality drops.

When Not To Use

  • When strict formal verifiability of model execution is required at large scale and ZK proofs are mandatory (ZKML too costly).
  • On extremely low bandwidth or high‑loss networks where streaming intermediate tensors remains impractical.
  • For architectures that cannot be cleanly sequenced into contiguous block shards (very wide multimodal heads).

Failure Modes

  • A slow or overloaded node (straggler) becomes the pipeline bottleneck and reduces throughput.
  • Malicious node returns bad outputs when TEEs or CDV are not available, degrading result trust.
  • Network partitions break shard chains and may force fallbacks with high latency.
  • Quantization may induce subtle accuracy drops for out‑of‑distribution cases not covered in benchmarks.

Core Entities

Models

  • Mixtral
  • Llama
  • Llama‑3
  • Lexi

Metrics

  • tokens/sec
  • steps/s
  • Accuracy
  • fairness
  • quality
  • creativity
  • generation performance

Benchmarks

  • HellaSwag
  • Lambada (OpenAI)
  • Causal Judgement
  • Disambiguation QA
  • Logical Deduction