Run full LLM inference on one wafer-scale chip; up to 10–600× speedups vs GPUs

February 6, 20257 min

Overview

Decision SnapshotReady For Pilot

Demonstrated on Cerebras WSE-2 with detailed microbenchmarks and open code. Production use requires wafer hardware access and handling current per-core memory and pipeline limitations.

Citations0

Evidence Strength0.80

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 65%

Authors

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai

Links

Abstract / PDF / Code

Why It Matters For Business

Wafer-scale chips can cut per-request latency and tokens-per-dollar for long outputs and high-throughput serving, making them worth testing for production LLM serving and cost-sensitive, long-context workloads.

Who Should Care

Summary TLDR

LLM inference is bandwidth-bound and GPU-focused systems miss key trade-offs on wafer-scale chips. WaferLLM introduces PLMR, a device model (Parallelism, Latency, Memory, Routing), and three system pieces: wafer-scale parallelism, MeshGEMM, MeshGEMV, plus a shift-based KV cache. Implemented on a Cerebras WSE-2, WaferLLM runs full LLaMA-family models on-chip, yielding large speedups (GEMV up to 606× vs an A100 GPU; end-to-end 10–20× vs optimized A100 clusters) and ~2.5× energy efficiency for long-output workloads. Code is open-sourced.

Problem Statement

Modern LLM runtimes are tuned for shared-memory GPUs. Wafer-scale accelerators use a massive mesh of small cores with local memory and limited routing. This creates huge non-uniform memory latencies and tight per-core memory and routing limits, so GPU-optimized designs waste bandwidth and underutilize wafer chips.

Main Contribution

PLMR device model that summarizes wafer-scale constraints: Parallelism, Latency, Memory, Routing.

Wafer-scale LLM parallelism policy for prefill/decode and a shift-based KV cache to balance cores.

Key Findings

WaferLLM achieves far higher accelerator utilization than prior methods.

Numbersup to 200× accelerator utilization vs SOTA methods

Practical UseIf you can run on a wafer-scale chip, redesign runtimes around PLMR to unlock large utilization gains.

Evidence RefAbstract; §7.1

MeshGEMV vastly accelerates the memory-bound decode step.

NumbersGEMV 606× faster and 16× more energy-efficient vs single A100 (measured)

Practical UseUse MeshGEMV or K-tree allreduce on mesh NoCs to cut decode latency and energy for long-output workloads.

Evidence RefAbstract; Table 6; §7.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GEMV latency (example)A100 single GPU 0.336 ms vs MeshGEMV 0.0012 msA100 single GPU≈280606× fasterGEMV microbenchmarks (Table 6)Table 6; §7.3Table 6
End-to-end LLM throughput (TPR)WaferLLM TPR example 604.4 vs SGLang 31.1 (LLaMA3-8B, 4096/128)SGLang on A100 cluster≈1020× faster (cluster-optimal)End-to-end LLaMA3-8B and LLaMA2-13B (Table 2)Table 2; §7.1Table 2

What To Try In 7 Days

Profile your workload to confirm decode (GEMV) is bandwidth-bound and benefits from on-chip memory.

Simulate shift-based KV layout on current infra to see memory balance gains before hardware access.

Run WaferLLM code or microbenchmarks on a rented WSE instance (or partner) to compare TPR and energy.

Agent Features

Memory
local per-core SRAM constrained (48KB/core)
Tool Use
CSL on WSE-2autotuning for core counts
Frameworks
SGLangT10Ladder
Architectures
mesh NoCwafer-scale many-core

Optimization Features

Token Efficiency
KV cache shift increases decode capacity
Infra Optimization
two-hop interleave communicationK-tree allreduce
System Optimization
PLMR-guided partitioningfine-grained replication for decode
Inference Optimization
MeshGEMMMeshGEMVshift-based KV cachetranspose-free placement

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on Cerebras WSE-2; cross-vendor results may vary.

Per-core SRAM (48KB) forces pipeline parallelism and some underutilization.

When Not To Use

If you lack access to a wafer-scale NoC device.

For very small models or single-GPU workloads where GPUs are cheaper and easier.

Failure Modes

Routing resource exhaustion if K is chosen too large for hardware limits.

Edge-core underutilization causing pipeline bubbles and lower throughput.

Core Entities

Models

LLaMA3-8BLLaMA2-13BCodeLLaMA-34BQWen2-72B

Metrics

TPR (Throughput per Request)TPOT (Time per Output Token)GEMM/GEMV latency (ms)Energy ratio (A100/WSE-2)accelerator utilization

Benchmarks

prefill (GEMM) throughputdecode (GEMV) throughputend-to-end Throughput per Request (TPR)KV cache max decode length

Context Entities

Models

A100 GPUNVLink/InfiniBand multi-GPU clustersCerebras WSE-3 (future mention)Tesla Dojo (future comparison)

Metrics

per-hop latency (α)per-routing latency (β)routing paths per core