Model‑agnostic hybrid sharding to run large models across heterogeneous, privacy-preserving nodes

Overview

Decision SnapshotNeeds Validation

The paper presents a workable system design and promising benchmarks, but lacks public code, detailed deployment metrics at scale, and cryptographic cost measurements; treat as prototype with applied ideas to test.

Citations0

Evidence Strength0.50

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Claudio Angione, Yue Zhao, Harry Yang, Ahmad Farhan, Fielding Johnston, James Buban, Patrick Colangelo

Links

Abstract / PDF

Why It Matters For Business

BSNS lowers hardware and bandwidth barriers so companies can run large models across existing, heterogeneous machines while keeping user data private and model execution auditable.

Who Should Care

CTO ML Engineer Product Manager Founder

Summary TLDR

This paper presents BSNS, a system that shards neural networks into contiguous blocks and runs them across diverse nodes (including consumer GPUs) using blockchain-aware routing and topology signals. Key ingredients: persistent homology to pick precomputed sharding schemes, a genetic optimizer for routing hyperparameters, KV caching to avoid recomputation, dynamic blockwise quantization and mixed matrix decomposition to shrink transfers and memory, and a stacked security layer (TEEs, CDV, ZKML, Split Learning, and a proposed Sequential Vector Encryption). Benchmarks show 16→8 bit compression gives negligible drops on several language tasks, and token throughput increases with batching in a 6

Problem Statement

Large models are expensive and centralized inference raises privacy, cost and single‑point‑of‑failure concerns. Running state‑of‑the‑art models on many low‑power or geographically distributed machines needs automated sharding, low‑bandwidth transfers, and verifiable privacy — all without breaking model quality.

Main Contribution

BSNS: blockchain‑aware sequential sharding that maps contiguous model blocks to node chains using network topology and heuristics.

Topology‑aware routing: persistent homology features + DHT and a BRKGA (biased random‑key genetic algorithm) to pick near‑optimal precomputed shardings.

Key Findings

Switching model communication and weights from 16‑bit to 8‑bit had negligible task drop on evaluated NLP benchmarks.

NumbersHellaSwag Llama‑8B: 0.76 → 0.76; Mixtral 7x8B: 0.78 → 0.77 (Table 1)

Practical UseYou can use 8‑bit weight/activation transfers in the BSNS pipeline to cut bandwidth and memory without hurting accuracy on similar evaluated tasks.

Evidence RefTable 1, Section 4

Batching and network quality materially increase token throughput in a 6‑node swarm.

NumbersLlama‑3 8B tokens/sec increased from 8 to 56 under 1 Gbit/s and <5 ms RTT (Table 2)

Practical UsePrefer larger request batches and good network links to raise tokens/sec across shard chains; batching is an effective throughput lever.

Evidence RefTable 2, Section 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Llama‑8B: 0.76 (16‑bit) → 0.76 (8‑bit)	16‑bit precision	≈0.00	HellaSwag	Table 1 shows near‑identical performance after 16→8 bit quantization	Table 1
Accuracy	Mixtral 7x8B: 0.78 (16‑bit) → 0.77 (8‑bit)	16‑bit precision	-0.01	HellaSwag	Table 1 compression results	Table 1

What To Try In 7 Days

Quantize a copy of a production model to 8‑bit and validate key tasks to measure accuracy impact similar to Table 1.

Prototype a 2–4 node sharded pipeline for a small transformer (split by blocks) and measure tokens/sec vs single‑node baseline.

Collect simple network‑topology features (latency, bandwidth, uptime) and try a basic genetic optimizer to pick node order for shards.

Optimization Features

Token Efficiency

KV cache reduces recomputation for token generation

Infra Optimization

Support for consumer GPUs and CPU fallbackReduced data transfer via quantization and mixed decomposition

Model Optimization

Mixed matrix decomposition (selective 8‑bit with critical 16‑bit retention)Dynamic blockwise quantization (16→8 bit transfers)LoRA

System Optimization

Persistent homology for topology fingerprintsDHT for routing and node discoveryDynamic rebalancing to avoid bottlenecks

Training Optimization

ZeRO optimizer for state shardingAdapter synchronization across nodesSplit learning for private training

Inference Optimization

KV cache to avoid repeated attention workDynamic sequential block sharding (BSNS)BRKGA for routing/hyperparameter tuningPrecomputed sharding schemas for common topologies

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Sharding optimality is NP‑hard; precomputation and heuristics trade optimality for speed.

ZKML is acknowledged as costly and used only for small private models.

When Not To Use

When strict formal verifiability of model execution is required at large scale and ZK proofs are mandatory (ZKML too costly).

On extremely low bandwidth or high‑loss networks where streaming intermediate tensors remains impractical.

Failure Modes

A slow or overloaded node (straggler) becomes the pipeline bottleneck and reduces throughput.

Malicious node returns bad outputs when TEEs or CDV are not available, degrading result trust.

Core Entities

Models

MixtralLlamaLlama‑3Lexi

Metrics

tokens/secsteps/sAccuracyfairnessqualitycreativitygeneration performance

Benchmarks

HellaSwagLambada (OpenAI)Causal JudgementDisambiguation QALogical Deduction

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Switching model communication and weights from 16‑bit to 8‑bit had negligible task drop on evaluated NLP benchmarks.

Batching and network quality materially increase token throughput in a 6‑node swarm.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Run vision encoding on cheap GPUs, send small embeddings, decode on A100s to cut multimodal inference cost.

Key finding

Move rollout work to cheap distributed GPUs and trade small policy lag for big cost savings.

Key finding

LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Key finding

Aladdin schedules LLM requests and scales GPUs together to cut serving cost while meeting token-level SLOs.

Key finding

ThunderServe: schedule and split LLM inference across diverse cloud GPUs to raise throughput and cut latency and cost

Key finding