Model‑agnostic hybrid sharding to run large models across heterogeneous, privacy-preserving nodes

July 29, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper presents a workable system design and promising benchmarks, but lacks public code, detailed deployment metrics at scale, and cryptographic cost measurements; treat as prototype with applied ideas to test.

Citations0

Evidence Strength0.50

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Claudio Angione, Yue Zhao, Harry Yang, Ahmad Farhan, Fielding Johnston, James Buban, Patrick Colangelo

Links

Abstract / PDF

Why It Matters For Business

BSNS lowers hardware and bandwidth barriers so companies can run large models across existing, heterogeneous machines while keeping user data private and model execution auditable.

Who Should Care

Summary TLDR

This paper presents BSNS, a system that shards neural networks into contiguous blocks and runs them across diverse nodes (including consumer GPUs) using blockchain-aware routing and topology signals. Key ingredients: persistent homology to pick precomputed sharding schemes, a genetic optimizer for routing hyperparameters, KV caching to avoid recomputation, dynamic blockwise quantization and mixed matrix decomposition to shrink transfers and memory, and a stacked security layer (TEEs, CDV, ZKML, Split Learning, and a proposed Sequential Vector Encryption). Benchmarks show 16→8 bit compression gives negligible drops on several language tasks, and token throughput increases with batching in a 6

Problem Statement

Large models are expensive and centralized inference raises privacy, cost and single‑point‑of‑failure concerns. Running state‑of‑the‑art models on many low‑power or geographically distributed machines needs automated sharding, low‑bandwidth transfers, and verifiable privacy — all without breaking model quality.

Main Contribution

BSNS: blockchain‑aware sequential sharding that maps contiguous model blocks to node chains using network topology and heuristics.

Topology‑aware routing: persistent homology features + DHT and a BRKGA (biased random‑key genetic algorithm) to pick near‑optimal precomputed shardings.

Key Findings

Switching model communication and weights from 16‑bit to 8‑bit had negligible task drop on evaluated NLP benchmarks.

NumbersHellaSwag Llama‑8B: 0.760.76; Mixtral 7x8B: 0.780.77 (Table 1)

Practical UseYou can use 8‑bit weight/activation transfers in the BSNS pipeline to cut bandwidth and memory without hurting accuracy on similar evaluated tasks.

Evidence RefTable 1, Section 4

Batching and network quality materially increase token throughput in a 6‑node swarm.

NumbersLlama‑3 8B tokens/sec increased from 8 to 56 under 1 Gbit/s and <5 ms RTT (Table 2)

Practical UsePrefer larger request batches and good network links to raise tokens/sec across shard chains; batching is an effective throughput lever.

Evidence RefTable 2, Section 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyLlama‑8B: 0.76 (16‑bit) → 0.76 (8‑bit)16‑bit precision≈0.00HellaSwagTable 1 shows near‑identical performance after 16→8 bit quantizationTable 1
AccuracyMixtral 7x8B: 0.78 (16‑bit) → 0.77 (8‑bit)16‑bit precision-0.01HellaSwagTable 1 compression resultsTable 1

What To Try In 7 Days

Quantize a copy of a production model to 8‑bit and validate key tasks to measure accuracy impact similar to Table 1.

Prototype a 2–4 node sharded pipeline for a small transformer (split by blocks) and measure tokens/sec vs single‑node baseline.

Collect simple network‑topology features (latency, bandwidth, uptime) and try a basic genetic optimizer to pick node order for shards.

Optimization Features

Token Efficiency
KV cache reduces recomputation for token generation
Infra Optimization
Support for consumer GPUs and CPU fallbackReduced data transfer via quantization and mixed decomposition
Model Optimization
Mixed matrix decomposition (selective 8‑bit with critical 16‑bit retention)Dynamic blockwise quantization (16→8 bit transfers)LoRA
System Optimization
Persistent homology for topology fingerprintsDHT for routing and node discoveryDynamic rebalancing to avoid bottlenecks
Training Optimization
ZeRO optimizer for state shardingAdapter synchronization across nodesSplit learning for private training
Inference Optimization
KV cache to avoid repeated attention workDynamic sequential block sharding (BSNS)BRKGA for routing/hyperparameter tuningPrecomputed sharding schemas for common topologies

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Sharding optimality is NP‑hard; precomputation and heuristics trade optimality for speed.

ZKML is acknowledged as costly and used only for small private models.

When Not To Use

When strict formal verifiability of model execution is required at large scale and ZK proofs are mandatory (ZKML too costly).

On extremely low bandwidth or high‑loss networks where streaming intermediate tensors remains impractical.

Failure Modes

A slow or overloaded node (straggler) becomes the pipeline bottleneck and reduces throughput.

Malicious node returns bad outputs when TEEs or CDV are not available, degrading result trust.

Core Entities

Models

MixtralLlamaLlama‑3Lexi

Metrics

tokens/secsteps/sAccuracyfairnessqualitycreativitygeneration performance

Benchmarks

HellaSwagLambada (OpenAI)Causal JudgementDisambiguation QALogical Deduction