Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Wafer-scale chips can cut per-request latency and tokens-per-dollar for long outputs and high-throughput serving, making them worth testing for production LLM serving and cost-sensitive, long-context workloads.
Summary TLDR
LLM inference is bandwidth-bound and GPU-focused systems miss key trade-offs on wafer-scale chips. WaferLLM introduces PLMR, a device model (Parallelism, Latency, Memory, Routing), and three system pieces: wafer-scale parallelism, MeshGEMM, MeshGEMV, plus a shift-based KV cache. Implemented on a Cerebras WSE-2, WaferLLM runs full LLaMA-family models on-chip, yielding large speedups (GEMV up to 606× vs an A100 GPU; end-to-end 10–20× vs optimized A100 clusters) and ~2.5× energy efficiency for long-output workloads. Code is open-sourced.
Problem Statement
Modern LLM runtimes are tuned for shared-memory GPUs. Wafer-scale accelerators use a massive mesh of small cores with local memory and limited routing. This creates huge non-uniform memory latencies and tight per-core memory and routing limits, so GPU-optimized designs waste bandwidth and underutilize wafer chips.
Main Contribution
PLMR device model that summarizes wafer-scale constraints: Parallelism, Latency, Memory, Routing.
Wafer-scale LLM parallelism policy for prefill/decode and a shift-based KV cache to balance cores.
MeshGEMM and MeshGEMV: new GEMM/GEMV algorithms designed to bound communication hops and routing paths.
An implementation on Cerebras WSE-2 with end-to-end evaluation and open-source code.
Key Findings
WaferLLM achieves far higher accelerator utilization than prior methods.
MeshGEMV vastly accelerates the memory-bound decode step.
End-to-end LLM inference is much faster on wafer-scale hardware.
Prefill (GEMM) benefits from MeshGEMM.
Shift-based KV cache increases token capacity dramatically.
Energy efficiency gains hold end-to-end for long outputs.
Results
GEMV latency (example)
End-to-end LLM throughput (TPR)
Prefill throughput (GEMM)
KV cache capacity (max decode length)
End-to-end energy efficiency
Who Should Care
What To Try In 7 Days
Profile your workload to confirm decode (GEMV) is bandwidth-bound and benefits from on-chip memory.
Simulate shift-based KV layout on current infra to see memory balance gains before hardware access.
Run WaferLLM code or microbenchmarks on a rented WSE instance (or partner) to compare TPR and energy.
Agent Features
Memory
- local per-core SRAM constrained (48KB/core)
Tool Use
- CSL on WSE-2
- autotuning for core counts
Frameworks
- SGLang
- T10
- Ladder
Architectures
- mesh NoC
- wafer-scale many-core
Optimization Features
Token Efficiency
- KV cache shift increases decode capacity
Infra Optimization
- two-hop interleave communication
- K-tree allreduce
System Optimization
- PLMR-guided partitioning
- fine-grained replication for decode
Inference Optimization
- MeshGEMM
- MeshGEMV
- shift-based KV cache
- transpose-free placement
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only on Cerebras WSE-2; cross-vendor results may vary.
- Per-core SRAM (48KB) forces pipeline parallelism and some underutilization.
- Some speedups depend on on-chip NoC properties not present in all accelerators.
When Not To Use
- If you lack access to a wafer-scale NoC device.
- For very small models or single-GPU workloads where GPUs are cheaper and easier.
- When per-request latency for a single short token matters more than long-output throughput.
Failure Modes
- Routing resource exhaustion if K is chosen too large for hardware limits.
- Edge-core underutilization causing pipeline bubbles and lower throughput.
- Autotuning mismatch for variable-length workloads leading to suboptimal core choice.
Core Entities
Models
- LLaMA3-8B
- LLaMA2-13B
- CodeLLaMA-34B
- QWen2-72B
Metrics
- TPR (Throughput per Request)
- TPOT (Time per Output Token)
- GEMM/GEMV latency (ms)
- Energy ratio (A100/WSE-2)
- accelerator utilization
Benchmarks
- prefill (GEMM) throughput
- decode (GEMV) throughput
- end-to-end Throughput per Request (TPR)
- KV cache max decode length
Context Entities
Models
- A100 GPU
- NVLink/InfiniBand multi-GPU clusters
- Cerebras WSE-3 (future mention)
- Tesla Dojo (future comparison)
Metrics
- per-hop latency (α)
- per-routing latency (β)
- routing paths per core

