Overview
Demonstrated on Cerebras WSE-2 with detailed microbenchmarks and open code. Production use requires wafer hardware access and handling current per-core memory and pipeline limitations.
Citations0
Evidence Strength0.80
Confidence0.87
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
Wafer-scale chips can cut per-request latency and tokens-per-dollar for long outputs and high-throughput serving, making them worth testing for production LLM serving and cost-sensitive, long-context workloads.
Who Should Care
Summary TLDR
LLM inference is bandwidth-bound and GPU-focused systems miss key trade-offs on wafer-scale chips. WaferLLM introduces PLMR, a device model (Parallelism, Latency, Memory, Routing), and three system pieces: wafer-scale parallelism, MeshGEMM, MeshGEMV, plus a shift-based KV cache. Implemented on a Cerebras WSE-2, WaferLLM runs full LLaMA-family models on-chip, yielding large speedups (GEMV up to 606× vs an A100 GPU; end-to-end 10–20× vs optimized A100 clusters) and ~2.5× energy efficiency for long-output workloads. Code is open-sourced.
Problem Statement
Modern LLM runtimes are tuned for shared-memory GPUs. Wafer-scale accelerators use a massive mesh of small cores with local memory and limited routing. This creates huge non-uniform memory latencies and tight per-core memory and routing limits, so GPU-optimized designs waste bandwidth and underutilize wafer chips.
Main Contribution
PLMR device model that summarizes wafer-scale constraints: Parallelism, Latency, Memory, Routing.
Wafer-scale LLM parallelism policy for prefill/decode and a shift-based KV cache to balance cores.
Key Findings
WaferLLM achieves far higher accelerator utilization than prior methods.
MeshGEMV vastly accelerates the memory-bound decode step.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GEMV latency (example) | A100 single GPU 0.336 ms vs MeshGEMV 0.0012 ms | A100 single GPU | ≈280–606× faster | GEMV microbenchmarks (Table 6) | Table 6; §7.3 | Table 6 |
| End-to-end LLM throughput (TPR) | WaferLLM TPR example 604.4 vs SGLang 31.1 (LLaMA3-8B, 4096/128) | SGLang on A100 cluster | ≈10–20× faster (cluster-optimal) | End-to-end LLaMA3-8B and LLaMA2-13B (Table 2) | Table 2; §7.1 | Table 2 |
What To Try In 7 Days
Profile your workload to confirm decode (GEMV) is bandwidth-bound and benefits from on-chip memory.
Simulate shift-based KV layout on current infra to see memory balance gains before hardware access.
Run WaferLLM code or microbenchmarks on a rented WSE instance (or partner) to compare TPR and energy.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluated only on Cerebras WSE-2; cross-vendor results may vary.
Per-core SRAM (48KB) forces pipeline parallelism and some underutilization.
When Not To Use
If you lack access to a wafer-scale NoC device.
For very small models or single-GPU workloads where GPUs are cheaper and easier.
Failure Modes
Routing resource exhaustion if K is chosen too large for hardware limits.
Edge-core underutilization causing pipeline bubbles and lower throughput.

